Who gets your data when OpenAI goes bankrupt?
As more of our clients look to leverage AI, I’m often asked: What happens to my data if an AI provider (say, OpenAI) goes out of business? Let’s unpack that beyond headlines and gossip.
AI companies feed on personal and protected data
We’re seeing “AI-as-a-Service” firms up ramp data capture from free or non-enterprise tiers. The full body of content (or corpus) that LLMs train on keeps growing—and increasingly, in addition to public information and copyright-protected published content, this includes personal and non-published data. This may include inputs and outputs from developer prompts (integrated tooling), chats, and publically accessible repos like GitHub (which often contain sensitive info, not intended to be made public)
Consider Anthropic:
- In August 2025 they updated their Consumer Terms: Claude Free, Pro, and Max accounts now default to sharing your chat and code history for model training, with retention extended to five years, unless you explicitly opt out by September 28, 2025 (Anthropic, The Verge).
- Anthropic has also been mired in a copyright lawsuit, Bartz v Anthropic, wherein they allegedly trained on millions of copyrighted books downloaded from pirate websites. The suit remains in flux as of today but while it’s unclear whether AI companies can call training LLMs on copyrighted works fair use, it is clear that doing so on pirated copies is not. (The Authors Guild)
Training data can also be extracted from published LLMs—which has generated more litigation
Pilfering data even after it ends up in an LLM is also a risk. There are real, documented techniques (what researchers call “training data extraction attacks”) that can recover verbatim text used in training, including sensitive or copyrighted material (arXiv).
Further academic work helps clarify that memorization and extraction are distinctly concerning: models may accidentally memorize substrings in training data (over training), which then appear in outputs under certain prompts (arXiv).
These technical vulnerabilities are also playing out in court:
- Tremblay v. OpenAI: As early as January 27, 2025, a judge ordered OpenAI to produce its full training dataset for plaintiffs to review. (Debevoise Data).
- New York Times vs. OpenAI: The NYT alleges GPT models have reproduced Times articles verbatim and OpenAI has been dragged into court over data retention and discovery practices (WIRED, arXiv).
There are numerous class-action suits specifically related scraping copyrighted content to train LLMs. This includes authors suing over unauthorized use of works, and a suit against OpenAI by a Canadian coalition of news outlets. (Bloomberg Law, American Bar Association).
Business failure puts your data in greater danger of becoming merchandise
Most AI companies aren’t profitable yet. When one fails, creditors or buyers of the failed company’s assets (including model weights, training datasets, and prompts) could obtain and monetize that data. Think 23andMe, where bankruptcy or liquidation risked exposing users’ genetic data to unscrupulous actors at odds with consumer protections. (Richard Gottlieb)
Your private prompts, developer chats, repositories, and even personal information (direct or inferred) could become accessible to parties with uncertain intentions, unless those parties strictly follow contractual safeguards, or the data is appropriately segmented (both which need to be in place now to be enforced by bankruptcy courts or a Consumer Protection Ombudsman).
What to do, today: protect your GenAI usage and data
If you’re already using Generative AI tools to boost your developer (or personal) productivity, this section is for you. You should proactively take steps to protect your data.
- Require your provider to offer a “no training” tier, specifically one that does not train on prompt inputs or model outputs.
- Avoid tools or MCP (Model-Connected-Platform) integrations that log or archive your prompts/outputs where they could leak or get acquired upon bankruptcy or acquisition.
- Contractually enforce data control by insisting on specific data retention policies, deletion assurances, and in case of business failure/acquisition, clear legal assurance that your data cannot be transferred to third parties.
Example providers & what usage tier protects your data from being used for training:
- Antropic – Use Claude Gov, Claude for Work, API through Bedrock/Vertex. Free, Pro, Max—all default to training unless you opt out by Sept 28, 2025 (The Verge, Tom’s Guide)
- Open AI – Enterprise, Dedicated API with “no-training” setting. Free tier and most non-enterprise usage is used for training unless you pay for opt-out.
- xAI – Business and Enterprise tiers are excluded from training. Personal accounts must opt out. (Grok)
In summary: ensure your AI provider offers a usage tier explicitly exempting your data from training—and don’t rely on vague “privacy” marketing, the default is (usually) agency to use your data unless you pay or opt out.
Don’t wait on this!
There is no shortage of ongoing discovery and debate regarding the potential risk of AI training on aggressively sourced datasets, whether professional or personal. Privacy and copyright law will evolve, but you need to understand the risks and take steps now to protect yourself and your organization.
- There is real risk that your data will be (or has already been) used as training data: AI companies can be unscrupulous in their quest for robust training data. Allegations of training on stolen data are currently playing in court. AI companies are updating their usage policies with increasingly favorable terms (for them!) to enable data harvesting.
- Your data, once in a training corpus, may be used to train many LLMs. This compounds risk, because there is no guarantee that any given trained LLM will have complete protections to safeguard its training data.
- Monitor industry movements like acquisitions, data storage and bankruptcies related to your data safety.
- Proactively protect your data if you are using Generative AI. Choose commercial tiers that limit data exposure and usage without your consent.