Based on the information provided in the search results, OpenAI reportedly used transcriptions of over a million hours of YouTube videos to train GPT-4, its most advanced large language model. This was part of their effort to gather high-quality training data, which is crucial for the development and improvement of AI models like GPT-4. The company developed its Whisper audio transcription model to assist in this process, which allowed them to transcribe the YouTube content. The use of YouTube videos for training data was considered legally questionable by OpenAI, but they believed it to be fair use. OpenAI president Greg Brockman was personally involved in collecting videos that were used for this purpose. The company's spokesperson, Lindsay Held, stated that OpenAI curates unique datasets for each of its models to help their understanding of the world and uses numerous sources, including publicly available data and partnerships for non-public data. Google, which owns YouTube, has 'robots.txt files and Terms of Service that prohibit unauthorized scraping or downloading of YouTube content. Google spokesperson Matt Bryant mentioned that the company takes technical and legal measures to prevent such unauthorized use when they have a clear legal or policy basis to do so. The search results indicate that the training of GPT-4 on YouTube transcripts was part of a broader strategy by AI companies to overcome the challenge of finding sufficient and diverse data to train their models effectively. This strategy also included using data from other sources such as Github, chess move databases, and schoolwork content from Quizlet.
