The task of operating useful data from deepfakes, junk, and spam is getting harder for big data scientists looking to train the next generation of AI.

It’s difficult to say exactly how much data exists on the internet at any one time. Billions of gigabits are created and destroyed every day. However, if we were to try and capture the scope of the data that exists online, estimates suggest that the figure was about 175 zettabytes in 2022. 

A zettabyte is equal to 1,000 exabytes, or 1 trillion gigabytes, by the way. That’s (roughly) 3.5 trillion blu ray copies of Blade Runner: The Director’s Cut. If you converted all the data on the internet into blu-ray copies of Blade Runner: The Director’s Cut, and smashed every disk after watching it, you could spend about 510 times longer than the universe has existed watching Blade Runner before you ran out of copies. 

Was that a weird, tortured metaphor? Yes. Was it any more weird and unnecessary than Jared Leto’s presence in Blade Runner: 2049? Absolutely not. But I digress. The sheer amount of data that’s out there in the world is mind-boggling. It’s hard to fit into metaphors and defies real-world examples. 

Also, it seems we’re going to run out of it, and it might happen as early as 2030. 

We’re running out of (good) data?

The value of data has skyrocketed over the past few years. A global preoccupation with extracting, measuring, analysing, and—above all—monetising data defined the past decade. Big data has profoundly impacted our politics, entertainment, social spheres, and economies. 

Awareness of the things that can be accomplished with data—from optimising e-commerce revenues to cybercrime and putting people like Donald Trump in positions of political power—has led to a frenzied scramble for the stuff. Data is the world’s most valuable resourse. Like many other valuable resources, the rate at which we’re consuming it is turning out to be unsustainable. Organisations have tried frantically to gather as much data as possible. Any and all information about environmental conditions, personal spending habits, racial demographics, political bias, financial markets, and more has been gathered up into huge pools of Big Data.  

AI training models are to blame

However, there’s a problem related to the hot new use for huge data sets: training AI models.

“The gigantic volume of data that people stored but couldn’t use has found applications,” writed Atanu Biswas, a Professor at the Indian Statistical Institute in Kolkata. “The development and effectiveness of AI systems — their ability to learn, adapt and make informed decisions — are fuelled by data.” 

Training a large language model like the one that fuels OpenAI’s ChatGPT takes a lot of data. It took approximately 570 gigabytes of text data–about 300 billion words—to train ChatGPT. AI image generators are even hungrier, with stable diffusion engines like those powering DALL-E and Midjourney requiring over 5.8 billion image-text pairs to generate weird, unpleasant pictures where the hands are all wrong that Haiyo Miyazaki described as “an insult to life itself.”

This is because these generative AI models “learn” by intaking an almost unfathomable amount of data then using statistical probability to create results based on the observable patterns in that data. 

Basically, what you put in defines what you get out.

Bad data poisons AI models

Increasingly, the huge reserves of data used to train these generative AI models are starting to look thin on the ground. Sure, there’s a brain-breakingly large amount of data out there, but putting low quality—even dangerous—data into a model can produce low quality—even dangerous—results. 

Information sourced from social media platforms may exhibit bias, prejudice, or potentially disseminate disinformation or illicit material, all of which may be unwittingly adopted by the model. 

For example, Microsoft trained an AI bot using Twitter data in 2016. Almost immediately, the endeavour resulted in outputs tainted with racism and misogyny. Another problem is that, as the amount of AI-generated content on the internet increases, new models could end up being trained by cannibalising the content created by old models. Since AI can’t create anything “new”, only rephrase existing content, development would stagnate. 

As a result, developers are locked in an increasingly desperate hunt for “better” content sources. These include books, online articles, scientific papers, Wikipedia, and specific curated web material. For instance, Google’s AI Assistant was trained using around 11,000 romance novels. The nature of the data supposedly made it a better conversationalist (and, one presumes, a hornier one?). The problem is that this kind of data—books, research papers, and so on—is a limited resource. 

The paper Will we run out of data? suggests that the point of data exhaustion could be alarmingly close. Comparing the projected “growth of training datasets for vision and language models” to the growth of available data, they concluded that “we will likely run out of language data between 2030 and 2050.” Additionally, they estimate that “we will likely run out of vision data between 2030 to 2070.” 

Where will we get our AI training data in the future? 

There are several ways this problem could resolve itself. Popular solutions include smaller language models and even synthetic data created specifically to train AIs. There has even been a proposed freeze on all new AI research and development, signed by Elon Musk and Steve Wozniak, amojng others. 

“This is an existential risk,” commented Geoffrey Hinton, one of AI’s most prominent figures, shortly after quitting Alphabet last year. “It’s close enough that we ought to be … putting a lot of resources into figuring out what we can do about it.”

One hellish vision for the future appeared during the 2023 actors’ strike. During the strike, the MIT Technology Review reported that tech firms extended an opportunity to unemployed actors. They could earn $150 per hour by portraying a range of emotions on camera. The captured footage was them used to aid in the ‘training’ of AI systems.

At least we won’t all lose our jobs. Some of us will be paid to write new erotic fiction to power the next generation of Siri. 

  • Data & AI

Related Stories

We believe in a personal approach

By working closely with our customers at every step of the way we ensure that we capture the dedication, enthusiasm and passion which has driven change within their organisations and inspire others with motivational real-life stories.