Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

Tech Crunch - Apr 1st, 2025
Open on Tech Crunch

OpenAI faces serious allegations from a new paper by the AI Disclosures Project, accusing the company of training its GPT-4o model on non-public, paywalled books without proper licensing. The paper suggests that GPT-4o has a strong recognition of content from O'Reilly Media books, despite no licensing agreement in place. This accusation is based on a method called DE-COP, which assesses whether language models have prior knowledge of specific texts, pointing to potential misuse of copyrighted material in AI training.

The implications of these findings are significant, as they highlight ongoing concerns about copyright infringement in AI model training. OpenAI, which advocates for more lenient regulations on using copyrighted data, already faces several lawsuits over its data practices. Although the company claims to have some licensing agreements and opt-out mechanisms for copyright owners, this new development could impact its legal battles and reputation. The broader AI industry also faces scrutiny as companies increasingly seek high-quality data sources, sometimes employing domain experts to enhance AI capabilities.

Story submitted by Fairstory

RATING

7.2
Fair Story
Consider it well-founded

The article provides a well-rounded examination of the accusations against OpenAI regarding the use of copyrighted content in AI training. It is timely and addresses issues of public interest, such as AI ethics and copyright law. The article is clear and readable, with a logical structure that aids in understanding complex topics. However, it could benefit from a more balanced representation of perspectives, including a response from OpenAI, and a more detailed exploration of the study's methodology and findings. Overall, the article effectively raises awareness of the ethical and legal challenges in AI development, but its impact may be limited by the lack of definitive conclusions or responses from all parties involved.

RATING DETAILS

7
Accuracy

The article accurately reports on the accusations against OpenAI regarding the use of copyrighted content without permission, as highlighted by the AI Disclosures Project. The claim that OpenAI's GPT-4o model may have been trained on paywalled O'Reilly books is supported by the study's findings, although it notes the limitations of the methodology used. The article correctly mentions the lack of a definitive conclusion due to these limitations and potential alternative explanations, such as user interactions contributing to the data pool. However, the article could improve by providing more detailed references to the study's methodology and results to enhance verifiability.

6
Balance

The article presents a balanced view by acknowledging both the accusations against OpenAI and the company's potential defenses, such as existing licensing agreements and opt-out mechanisms for copyright owners. However, it predominantly focuses on the accusations and the study's findings, with less emphasis on OpenAI's perspective or responses. The lack of a comment from OpenAI is noted, but the article could benefit from a more comprehensive exploration of other viewpoints, such as the broader industry practices regarding AI training data.

8
Clarity

The article is well-structured and uses clear, concise language to convey complex information about AI training and copyright issues. It logically outlines the accusations, the study's findings, and the potential implications for OpenAI. The tone is neutral, avoiding sensationalism, which aids in maintaining clarity. However, the article could improve by simplifying some technical aspects of the DE-COP method for readers unfamiliar with AI research methodologies.

8
Source quality

The article relies on a study by the AI Disclosures Project, a credible source co-founded by notable figures in the tech and economics fields. The involvement of Tim O'Reilly and Ilan Strauss lends authority to the claims made. However, the article could improve by including more diverse sources, such as independent experts or industry analysts, to provide additional context and perspectives. The absence of a direct response from OpenAI slightly weakens the source quality, as it limits the representation of all involved parties.

7
Transparency

The article is transparent about the study's methodology, mentioning the DE-COP method and its limitations. It clearly states the co-authors' acknowledgment of the method's imperfections and the potential for alternative explanations. However, the article could enhance transparency by providing more detailed information on how the study was conducted and any potential conflicts of interest, particularly given Tim O'Reilly's dual role as a co-author and CEO of O'Reilly Media.

Sources

  1. https://techcrunch.com/2025/04/01/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books/
  2. https://20fix.com
  3. https://asimovaddendum.substack.com/p/did-openai-train-on-copyrighted-book
  4. https://bitcoinworld.co.in/openai-gpt4o-paywalled-books-claim/
  5. https://www.ssrc.org/publications/beyond-public-access-in-llm-pre-training-data-non-public-book-content-in-openais-models/