what are the most common public websits that llms train on

Answer
Based on the search results, the most common public websites that large language models (LLMs) are trained on include:
  1. General web crawl data - LLMs are typically trained on large web crawl datasets that include a wide variety of content from public websites across the internet.
  2. Open-source datasets - Some LLMs are trained on specific open-source datasets like Wikipedia, Common Crawl, and other publicly available text corpora.
  3. Social media data - LLMs may also incorporate data from public social media platforms like Twitter, Reddit, and others to capture more conversational and user-generated content.
  4. News articles - Many LLMs are trained on large collections of news articles and other journalistic content from reputable online publications.
  5. Books and ebooks - Some LLMs leverage digitized book and ebook data as part of their training corpus.
The search results indicate that the training data for state-of-the-art LLMs tends to be very large and diverse, drawing from a wide range of publicly available web sources. However, the exact details of the training data used for specific LLMs are often not fully disclosed by the companies and research labs that develop them.
Related
what are the differences between open-source and proprietary llms
how do llms improve website performance
what are some examples of websites that use llms