OpenAI’s models ‘memorized’ copyrighted content, new study suggests

Tech Crunch - Apr 4th, 2025
Open on Tech Crunch

A recent study co-authored by researchers from the University of Washington, the University of Copenhagen, and Stanford suggests that OpenAI's AI models, including GPT-4 and GPT-3.5, may have been trained on copyrighted materials without permission. The study introduces a novel method for detecting 'memorized' content in AI models by using 'high-surprisal' words to identify portions of text that models might have learned verbatim from their training data. Tests indicated that GPT-4 memorized sections of popular fiction and New York Times articles, highlighting potential copyright violations. This development could bolster ongoing lawsuits against OpenAI, brought by authors and programmers who claim unauthorized use of their works.

The implications of this study are significant in the ongoing debate over AI and copyright law. While OpenAI has advocated for broader fair use policies and has some licensing deals and opt-out mechanisms, the study underscores the need for greater transparency in AI training data. Co-author Abhilasha Ravichander emphasized the importance of having the ability to probe and audit AI models to ensure their trustworthiness. This research could influence future regulations and legal frameworks regarding the use of copyrighted content in AI training, potentially impacting how AI models are developed and deployed globally.

Story submitted by Fairstory

RATING

7.2
Fair Story
Consider it well-founded

The article presents a timely and relevant examination of the legal and ethical issues surrounding AI training and copyright infringement. It effectively highlights the controversies and ongoing legal battles involving OpenAI, providing a clear and accessible overview of complex topics. The use of credible sources, such as a study from reputable universities, adds authority to the claims made, though the article could benefit from more direct evidence and diverse perspectives to enhance its balance and transparency. Overall, the article serves as a valuable contribution to discussions about AI ethics and copyright law, with the potential to influence public opinion and policy debates.

RATING DETAILS

7
Accuracy

The story presents a detailed account of allegations against OpenAI regarding the use of copyrighted content in training its AI models. It accurately reports on the existence of lawsuits from authors and rights-holders, which is a verifiable fact. The story also discusses OpenAI's fair use defense, aligning with ongoing legal debates about AI and copyright. However, the claim that a study by researchers from reputable universities has identified memorization of copyrighted content by OpenAI's models needs more verification. While the study's methodology is described, the article does not provide direct evidence from the study to support the claims about specific content memorized by the models. This lack of direct evidence requires readers to seek out the study themselves for confirmation.

6
Balance

The article presents the issue primarily from the perspective of those accusing OpenAI of copyright infringement, including the study's findings and the lawsuits against OpenAI. While it notes OpenAI's fair use defense and advocacy for looser copyright restrictions, it does not delve deeply into OpenAI's rationale or provide quotes from OpenAI representatives. This creates an imbalance, as the article could benefit from more detailed representation of OpenAI's viewpoint or legal arguments to provide a fuller picture of the controversy.

8
Clarity

The article is well-structured and clearly explains complex topics, such as AI model training and the concept of 'high-surprisal' words. The language is accessible, and the flow of information is logical, which aids reader comprehension. However, the article could benefit from a more explicit delineation of the different perspectives involved, particularly OpenAI's, to enhance clarity regarding the multifaceted nature of the issue.

8
Source quality

The article references a study co-authored by researchers from the University of Washington, the University of Copenhagen, and Stanford, which are credible institutions. It also cites a doctoral student from the University of Washington, adding authority to the claims made. However, the article does not provide direct links to the study or additional sources that could validate the findings discussed. Including more diverse sources or expert opinions could enhance the credibility and depth of the reporting.

7
Transparency

The article provides a reasonable amount of context about the ongoing legal battles and the study's methodology for identifying memorized content. However, it lacks transparency in terms of specific data or excerpts from the study that support the claims made. The article could improve transparency by offering direct access to the study or more detailed explanations of the methods and findings, allowing readers to better assess the validity of the claims.

Sources

  1. https://arxiv.org/abs/2412.06370
  2. https://techcrunch.com/2025/04/01/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books/
  3. https://generative-ai-newsroom.com/fair-use-copyright-and-the-challenge-of-memorization-in-the-nyt-vs-openai-7f6c0a13f703
  4. https://hls.harvard.edu/today/does-chatgpt-violate-new-york-times-copyrights/
  5. https://dsi.appstate.edu/news/copyright-and-artificial-intelligence-your-academic-research-risk