Recent news articles have sensationalized claims that DeepSeek, an emerging AI company, "stole" OpenAI’s data in training its models. While intellectual property concerns in AI are legitimate, this narrative oversimplifies the legal and ethical nuances surrounding AI training data. As a lawyer specializing in AI and intellectual property law, I believe these reports misrepresent key facts and ignore critical context. Let’s break this down into two crucial points.
1. Much of OpenAI’s ChatGPT Training Data Isn’t Owned by OpenAI
A foundational flaw in the “stolen data” argument is the assumption that OpenAI possesses exclusive ownership over the training data that underlies ChatGPT. The reality is far more complex. OpenAI has publicly acknowledged that its models are trained on vast datasets scraped from the internet, including publicly available web content, books, and other sources OpenAI does not own. This means that OpenAI itself operates in a legal gray area concerning data usage—relying heavily on fair use arguments and open web content.
If DeepSeek’s model was trained on data obtained via methods similar to those OpenAI has employed, OpenAI’s argument for "stolen data" becomes paradoxical. While proprietary architectures, weights, and undisclosed proprietary datasets are protected, OpenAI cannot legally claim exclusivity over datasets that were never theirs in the first place.
2. DeepSeek’s Use of ChatGPT Outputs Likely Violates OpenAI’s Terms of Use, But That’s a Contract Issue, Not Theft
What is more legally problematic for DeepSeek is not the data itself but how it was obtained. If DeepSeek utilized OpenAI’s API or ChatGPT to generate outputs for training its own model, this would likely constitute a breach of OpenAI’s Terms of Use, which explicitly prohibit using ChatGPT-generated content to train competing models. However, a violation of terms of service is a contract dispute, not theft.
Terms of service violations typically result in remedies such as account suspension, monetary damages, or injunctions—not criminal allegations. Equating this to "stealing OpenAI’s data" conflates a civil contract breach with outright misappropriation, misleading the public about the true nature of the dispute.
The Real Issue: Enforceability of AI Training Restrictions
This controversy highlights a broader issue in AI law—how enforceable are training restrictions in the first place? If OpenAI’s models are trained on data it does not own, and a significant percentage of the data comes from scraping publicly-accessible data from the Web, can OpenAI turn around and seek to restrict competitors from using OpenAI-generated output to train their models?
Because courts have yet to definitively decide whether AI-generated outputs hold any copyright protection, and OpenAI cannot generally claim that the output is derived from training data for which it owns copyrights, its ability to restrict use of such AI-generated output may end up being quite limited.
Conclusion: A Sensationalized Mischaracterization
While DeepSeek’s methods may raise valid contractual concerns, the media's framing of this dispute as "theft" is likely misleading. DeepSeek’s possible violations of OpenAI’s Terms of Use—while potentially actionable—are a far cry from outright misappropriation. Instead of inflating this into a case of AI espionage, the focus should be on how AI companies can establish clear, enforceable norms for data usage in a rapidly evolving landscape.