Social media sites are seeking new revenue by selling users’ content to train generative AI models.

Generative artificial intelligence (AI) companies like OpenAI, Google, and Microsoft are on the hunt for new training data. In 2022 a research paper warned that we could run out of high quality data on which to train stable diffusion algorithms and large language models (LLMs) as soon as 2026. Since then, AI firms have reportedly found a potential source of new information: social media. 

Social media offers “vast” amounts of usable training data

In February, it was revealed that the social media site reddit had struck a deal with a large AI company. The $60 million per year agreement will see the company train its generative AI using content created by reddit’s users. The buyer was later revealed to be Google, which is locked in a bitter AI race with OpenAI and Microsoft.

This will allegedly provide Google with an “efficient and structured way to access the vast corpus of existing content on Reddit.” 

The move caused significant controversy in the ramp up to an expected public offering by the company. A week later, social media platform tumblr and blog hosting platform WordPress also announced that they would be selling their users’ data to Midjourney and OpenAI. 

The race for AI training data  

These developments mark an evolution of an existing trend. Increasingly the AI industry is shifting from unpaid data scraping towards a model where the owners of data are paid for it. Recently, OpenAI was revealed to be paying between $1 million and $5 million a year to licence copyrighted news articles from outlets like the New York Times and the Washington Post to train its AI models.  

In December 2023, OpenAI also signed an agreement with Axel Springer. The German publisher is being paid an undisclosed sum for access to articles published Politico and Business Insider. OpenAI has also struck deals with other organisations, including the Associated Press, and is reportedly in licensing talks with CNN, Fox, and Time. 

However, a content creation (or journalistic) organisation licensing out the content it creates and distributes is one thing. The sale of public and private user data generated on social media is an entirely different matter. Of course, such data is already sold and mined heavily for advertising purposes. Income from the sale of personal data makes up the majority of social media sites like Facebook’s revenue.

If social media content is mined to train the next generation of AI, it’s essential that user data is anonymised. This may be less of an issue on sites like Reddit and Tumblr, where user identities are already concealed. However, the race for AI training data continues to gather pace. Soon, AI companies may look towards less anonymised sites like Instagram and X (formerly Twitter).

  • Data & AI

Related Stories

We believe in a personal approach

By working closely with our customers at every step of the way we ensure that we capture the dedication, enthusiasm and passion which has driven change within their organisations and inspire others with motivational real-life stories.