It’d be interesting to see how much this changes if you were to restrict the training dataset to books written in the last twenty years, I suspect the model would be a lot less negative. Older books tend to include stuff which does not fit with modern ideals and it’d be a real struggle to avoid this if such texts are used for training.
For example I was recently reading a couple of the sequels to The Thirty-Nine Steps (written during WW1) and they include multiple instances that really date them to an earlier era with the main character casually throwing out jarringly racist stuff about black South Africans, Germans, the Irish, and basically anyone else who wasn’t properly English. Train an AI on that and you’re introducing the chance for problematic output - and chances are most LLMs have been trained on this series since they’re now public domain and easily available.
I think the main problem with searching for fediverse posts is not that they’re not indexed but the lack of a singular tag to append when you want to search for them. To search for reddit posts it was easy because you could put in your keywords and stick ‘reddit’ or ‘site:reddit.com’ onto the end, but now there’s too many domains to keep track of and you can’t rely on appending ‘lemmy’ pointing a search engine towards all Lemmy instances, let alone kbin/mbin instances.