A new study reveals that Meta's latest artificial intelligence model can reproduce nearly half of the first Harry Potter book from memory, raising fresh concerns about copyright infringement in AI training as the company faces mounting legal pressure from authors and publishers.
Researchers from Stanford, Cornell, and West Virginia University found that Meta's Llama 3.1 70B model has memorized 42 percent of "Harry Potter and the Sorcerer's Stone" and can accurately reproduce 50-word passages more than half the time when prompted. The findings, published this month, mark a dramatic increase from earlier versions of Meta's AI, which retained only 4.4 percent of the same book.
The research team tested five major AI models on their ability to reproduce text from the Books3 dataset, a collection of approximately 200,000 books that included copyrighted works1. Llama 3.1 70B showed the highest rates of verbatim reproduction, particularly for popular titles including "The Hobbit" and George Orwell's "1984"2.
"We'd expected to see some kind of low level of replicability on the order of one or two percent," Stanford law professor Mark Lemley told Understanding AI3. "The first thing that surprised me is how much variation there is."
The model showed striking differences in memorization rates across different books. While it retained large portions of well-known works, it memorized only 0.13 percent of "Sandman Slim," a 2009 novel by Richard Kadrey3.
The findings arrive as Meta faces a class-action copyright lawsuit, with Kadrey serving as the lead plaintiff1. The study's authors suggest the results could influence ongoing litigation against AI companies that train models on copyrighted material without permission.
The U.S. Copyright Office released a report in May stating that where AI-generated outputs are substantially similar to training data, there is a "strong argument" that the models themselves infringe reproduction and derivative work rights2.
Researchers cannot determine from their findings alone whether Llama 3.1 70B was trained directly on complete book texts or absorbed content through secondary sources like fan sites and reviews1. The higher memorization rates for popular books compared to obscure titles suggests the model may have encountered these works repeatedly during training.
"There are really striking differences among models in terms of how much verbatim text they have memorized," said James Grimmelmann, a Cornell law professor who collaborated with the study's authors2.