Are We Running Out of Data?

// The Data Wall Thesis

In late 2024, a quietly alarming paper from Epoch AI put numbers to a concern that had been circulating at the frontier of AI research for several years: the world may be approaching the limits of high-quality, human-generated text available for training large language models. The estimate was stark — at current training data consumption rates, frontier AI models could exhaust the high-quality corpus of internet text by the late 2020s. The "data wall" thesis subsequently became mainstream, appearing in earnings calls, research notes, and the strategic planning documents of every major AI lab. GPT-4 reportedly trained on approximately 13 trillion tokens; GPT-5-class models are rumored to require several multiples of that. The math is not complicated: the internet is large, but it is finite, and models trained on web-scale data have now seen most of it, much of it multiple times. The immediate reaction from many commentators was to conclude that AI progress would plateau. We think this conclusion is wrong — but understanding why it is wrong reveals a set of investment implications that are more important than the data wall thesis itself.

~13T
Tokens in GPT-4 training corpus (est.)
~300B
High-quality English web pages indexed
10–100×
Synthetic data scale-up potential

// Three Responses That Change the Calculus

The data wall thesis, taken in isolation, treats "data" as synonymous with "human-generated internet text." This is a category error that misses three structural responses already reshaping the landscape. First: synthetic data generation. Models can generate training data for other models — with appropriate techniques to prevent quality degradation (a phenomenon called "model collapse"). Companies like Scale AI, and frontier labs' internal programs, are producing high-quality synthetic datasets that extend the effective training corpus by orders of magnitude. Second: test-time compute scaling. OpenAI's o1 and o3 models demonstrated that a different form of scaling — not pre-training on more data, but applying more compute at inference time through extended reasoning — can continue to improve model capability without requiring more training data. This is a fundamental shift in where the value of compute is applied. Third: specialized domain data. The "data wall" applies primarily to generic internet text. The vast majority of the world's highest-value data — clinical trial records, legal case histories, engineering specifications, financial transaction data — has never been digitized, structured, or made available for model training. Companies that sit on these proprietary data assets and have the legal right to use them for training occupy a structurally advantaged position that the scaling laws debate does not touch.

// The Proprietary Data Moat Investment Thesis

For private equity investors, the data scarcity conversation resolves to a single, powerful investment thesis: companies with proprietary, structured, high-quality domain data are becoming structurally more valuable as the public internet corpus approaches exhaustion. This is not a new observation — "data moat" has been an investment buzzword for a decade — but the mechanics are now much sharper. A clinical diagnostics company that has been digitizing patient records for 20 years, a logistics platform that has processed 500 million shipments, a legal technology company with structured case outcome data — these businesses are sitting on assets that the world's most well-resourced technology companies cannot acquire at any price. The value of these assets will compound as frontier model development increasingly depends on proprietary domain data for fine-tuning, evaluation, and specialized capability development. At Covalent, our investment framework explicitly weights "proprietary data exhaust" and "training rights" as top-tier characteristics in evaluating potential portfolio companies. The businesses that generate irreplaceable data as a byproduct of doing their core work — not as a data collection exercise but as operational exhaust — are the businesses most defensible against AI disruption and most likely to benefit from AI amplification. The data wall, properly understood, is not a threat to AI progress. It is an accelerant for the value of proprietary data assets — and a thesis we are investing against now.

  • High-quality public internet text for AI pre-training faces a genuine long-term supply constraint.
  • Synthetic data, test-time compute scaling, and domain-specific datasets are the three structural responses.
  • Companies with proprietary, structured, high-quality domain data are becoming structurally more valuable.
  • "Data exhaust" — data generated as a byproduct of operations — is the most defensible form of data moat.
  • The data wall is not a constraint on AI progress; it is an accelerant for proprietary data asset values.

// SOURCES & FURTHER READING

  1. Epoch AI. "Scaling Data-Constrained Language Models." Epoch AI Research, 2024. [epochai.org]
  2. Shumailov, I. et al. "The Curse of Recursion: Training on Generated Data Makes Models Forget." arXiv, 2023. [arXiv]
  3. OpenAI. "Learning to Reason with LLMs." OpenAI Blog, 2024. (o1 model card) [OpenAI]
  4. Villalobos, P. et al. "Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning." Epoch AI, 2024. [arXiv]
  5. Scale AI. "Enterprise AI Data Report 2024." [Scale AI]
← PE Alignment Next: Contrarian Investing →