Challenging the AI Data Shortage Myth

Do smaller purpose-built AIs change the narrative?

Jun 13, 2024

A friend sent me Leopold Aschenbrenner’s sense of artificial intelligence’s near-term likely evolution, “Situational Awareness: A Decade Ahead.” Aschenbrenner, who works on superalignment for OpenAI, wrote a treatise that is worth the read: it is both expansive and detailed. I got caught up on Aschenbrenner’s assertion that “…we’re running out of internet data.” I questioned this assertion back in November and am still skeptical when it’s made. If we shift from web-scale general-purpose AIs to domain-specific or purpose-built AIs, I think the “we’re running out of data!” meme all but evaporates, leaving the boogeyman of data curation, clean-up, and enrichment in its place.

Looking at the “Running Out of Data” Meme

When I wrote my piece back in late November, I cited Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn’s “Will we run out of data? Limits of LLM scaling based on human-generated data.” Their key finding (at the time) often was referenced without an important qualifier (in italics below, italics mine):

Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images).

Click the link now and you get the most recent version of the paper, which is substantively different (even as it’s under the same title at the same link). The abstract now reads:

Our findings indicate that if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained.

…with no qualifier.

(Quick aside: As an analyst, version control matters. It was disconcerting to see what is a effectively new paper using the same title and link as an earlier version because I could be talking with a colleague about Villabolos et al.’s paper asking “Will we run out of data?” and we could be talking about two substantively different papers. This pushes knowledge management into the user and muddies the waters around important questions.)

In any case, Villabolos et al.’s discussion of the methodology underpinning their analysis is interesting:

They note that Common Crawl—an open-source collection of data scraped from over 250 billion web pages—serves as the basis for most open web datasets. Villalobos et al. describe Common Crawl as “a subset of the indexed web.”
In a table, they estimated that Common Crawl had an estimated 130 trillion tokens. They also estimated that the Indexed web contained an estimated 510 trillion tokens and the entire web contained 3,100 trillion tokens.

So, when we are talking about using web data for training LLMs, we are talking about a subset of a subset…and it turns out that you can trim the dataset even more for optimal performance. Villalobos et al. note, “Marion et al. (2023) found that pruning around 50% of deduplicated data from a subset of Common Crawl using a perplexity measure led to optimal performance.”

Data Volume and AI Function

As a liberal arts guy with an intelligence background who was an enterprise-level system architect for a cadre of top-tier analytic users, I continue to grapple with one question:

How large does any one organization need the language model that underpins its enterprise generative AI to be?

If I am OpenAI (or Google or Anthropic), I need to work at web scale because I have built a general-purpose platform designed to serve a broad but poorly defined user base (and simultaneously compete against other web-scale general purpose AIs). That these platforms work so well with non-average questions in specialized domains of knowledge is remarkable given that they are working off samples of (massive amounts of) human knowledge.

If, however, I wanted to build the best purpose-built generative AI for an organization’s mission (or missions), I wonder how my data collection, enrichment, and processing strategies change? Does a “smaller” (which I’d assume to still be fairly large) mission-focused language model outperform massive, general-purpose language models at the enterprise level?

For as much important and interesting basic science is still being done on generative AI, we’re at a point where I feel like we should be seeing a lot more purpose-built generative AIs…and those AIs are really about collecting, organizing, and enriching information to so that it is aligned to organizational mission areas.

Aligning Information, Artificial Intelligences, and Organizational Missions

Consider Harvard University’s mission:

Harvard’s mission is to advance new ideas and promote enduring knowledge.

My read is that there would be two core categories of information: information that they produced (a backwards-looking knowledge-management-based AI to support enduring knowledge) and information they use to advance new ideas (an AI uses information to support research being done in any of its 11 degree-awarding schools).

If I look at the Kennedy School for Government, it has 12 centers and 80+ initiatives. I suspect if the initiatives were clustered, we could probably tease out 2-3 dozen subject areas that could inform information collection and AI design. The amount of data that’d be used for the knowledge management AI likely would be relatively small. The amount of information used for the research support AI would be orders of magnitude larger…but nowhere near web scale.

Polluting the Internet: The Irony of Generative AI and Internet Data

Looking back at Aschenbrenner’s piece and Villabolos et al.’s research in the context of enterprise or purpose-built AIs, I struggle to worry about the supply of data. Data processing and clean-up will unlock new supplies of information and it would not surprise me if that work is done first (and possibly best) at the level of the enterprise / organization working in domains that are central to their missions.

The larger looming problem, as I see it, is the ability of people to rapidly generate oceans of low-quality text using generative AIs. Much like readers have been struggling for years to differentiate between original and sponsored content in newspapers (e.g., “Fewer than one in 10 people can distinguish online sponsored content from news articles”), the cost of generating plausible sounding content in almost any domain is dropping to nil.

For the enterprise, this can be mitigated by its data strategy, how it designs its AIs, and how it trains its people. For consumers, though, the story is more complex as anyone can create a polished, professional website that contains AI-generated content that may have little to no relationship with reality (e.g., “ChatGPT and Fake Citations”) and broad-spectrum collectors like Common Crawl might not have the mechanisms needed to discern low-quality content.

More broadly, for as much as we talk about the need for trustworthy AI, we also need to be talking about information quality, media literacy, and critical thinking skills all of which will be challenged as AI continues to replace search (or discovery-driven scanning) as a common means of interrogating and interacting with web-based data.

The Angry Analyst

Discussion about this post