Small language models could make generative AI more ethical

The emergence of sophisticated generative artificial intelligence (AI) applications—including image generators like Midjourney and conversational chatbots like OpenAI’s Chat-GPT—has sent shockwaves through the economy and popular culture in equal measure. The technology, made accessible to a massive audience in a short span of time, has attracted immense interest, investment, and controversy. However, the data used to train large language models

Aside from criticisms rooted in the role played by generative AI in creating sexually explicit deepfakes of Taylor Swift, spreading misinformation, and enforcing prejudicial biases, the most prominent controversy surrounding the technology stems from the legal and ethical issues relating to the data used to train large language models (LLMs).

Generative AI large language models on unstable ethical ground

According to Chat-GPT 3.5 itself, LLMs are “trained on a vast dataset of text from various sources, including books, articles, websites, and other publicly available written material. This data helps us learn patterns and structures of language to generate responses and assist users.”

Essentially, an LLM scrapes billions of lines of text from across the internet in order to train its learning model. Because generative AI consumes so much information, it can convincingly mimic, response, and “create” responses based on the data it has examined. However, authors, journalists, and several news organisations have raised concerns. The issue they highlight is that an LLM scraping content written by human authors is, in effect, uncredited and unpaid use of those writers’ work.

Chat-GPT generates the response that “while large language models learn from existing text, they do so within legal and ethical boundaries, aiming to respect intellectual property rights and promote responsible usage.”

A statement by to the European Writers’ Council contradicts the claim. “Already, numerous criminal and damaging “AI business models” have developed in the book sector – with fake authors, fake books and also fake readers,” the council says in a letter. “The fundamental process of developing large language models such as GPT, Meta, StableLM, and BERT rest on using uncredited copyrighted work. These works, asserts the Council, are sourced from “shadow libraries such as Library Genesis (LibGen), Z-Library (Bok), Sci-Hub and Bibliotik – piracy websites.”

More ethical generative AI? Start by thinking smaller

AI developers train the most publicly visible forms of generative AI, like Chat-GPT and Midjourney, using billions of parameters. Therefore, these large language models need to crawl the web for every possible scrap of information in order to build up the quality of their responses. However, several recent developments in generative AI are “challenging the notion that scale is needed for performance.”

For example, the most recent version of OpenAI’s engine, Chat-GPT-4, operates using 1.5 billion parameters. That might sound like a lot, but the previous version, GPT-3.5, uses 175 billion.

Large language models are, one generation at a time, shrinking in size while their performance improves. Microsoft has created two small language models (SLMs) called Phi and Orca which, under certain circumstances, outperform large language models.

Unlike earlier generations—trained on vast diets of disorganised, unvetted data—SLMs use “curated, high-quality training data” according to Vanessa Ho from Microsoft.

They are more specific in scope, use less computing power (and therefore less energy—another relevant criticism of generative AI models), and could produce more reliable results when trained with the right data—potentially making them more useful from a business point of view. In 2022, Deepmind demonstrated that training smaller models on more data yields better performance than training larger models on fewer data.

AI needs to find a way of escaping its ethically dubious beginnings if the technology is to live up to its potential. The transition from large language models to smaller, higher quality data training sets would be a valuable step in the right direction.