Karolis Toleikis, Chief Executive Officer at IPRoyal, takes a closer look at large language models and how they’re powering the generative AI future.

Since the launch of ChatGPT captured the global imagination, the technology has attreacted questions regarding its workings. Some of these questions stem from a growing interest in the field of AI design. Others are the result of suspicion as to whether AI models are being trained ethically.

Indeed, there’s good reason to have some level of skepticism towards generative AI. After all, current iterations of Large Language Models use underlying technology that’s extremely data-hungry. Even a cursory glance at the amount of information needed to train models like GPT-4 indicates that documents in the public domain were never going to be enough.

But I’m going to leave the ethical and legal questions for better-trained specialists in those specific fields and look at the technical side of AI. The development of generative AI is a fascinating occurence, as several distinct yet closely related disciplines had to progress to the point where such an achievement became possible.

While there are numerous different AI models, each accomplishing a separate goal, most of the current underlying technologies and requirements have many similarities. So, I’ll be focusing on Large Language Models as they’re likely the most familiar version of an AI model to most people.

How do LLMs work?

There are a few key concepts everyone should understand about AI models as I see many of them being conflated into one:

Large Language Model (LLM) is a broad term that describes any language model that uses a large amount of (usually) human-written text and is primarily used to understand and generate human-like language. Every LLM is part of the Natural Language Processing (NLP) field.

A Generative Pre-trained Transformer (GPT) is a type of LLM that was introduced by OpenAI. Unlike some other LLMs, the primary goal was to specifically generate human-like text (hence, “generative”). Pre-trained simply means that the model requires lots of labeled data to function.

Transformer is another part of GPT that people are often confused by. While GPTs were introduced by OpenAI, Transformers were initially developed by Google researchers in a breakthrough paper called “Attention is All You Need”.

One of the major breakthroughs was the implementation of self-attention. This allows a model that uses such a transformer to evaluate all words within it at once. Previous iterations of language models had numerous issues such as putting more emphasis on recent words.

While the underlying technology of a transformer is extremely complex, the basics are that they convert words (for language models) into mathematical vectors of three-dimensional space. Earlier iterations would only convert single words and place them in a three-dimensional space with some prediction if the words are related (such as “king” and “queen” being closer to each other than “cat” and “king”). A transformer is able to evaluate an entire sentence, allowing better contextual understanding.

Almost all current LLMs use transformers as their underlying technology. Some refer to non-OpenAI models as “GPT-like.” However, that may be a bit of an oversimplification. Nevertheless, it’s a handy umbrella term.

Scaling and data

Anyone who has spent some time analysing natural human language will quickly realize that language, as a concept or technology, is one of the most complicated things ever created. In fact, philosophers and linguists still spend decades trying to decipher even small aspects of natural language.

Computers have another problem – they don’t get to experience language as it is. So, like the aforementioned transformers, language has to be converted into a mathematical representation, which poses significant challenges by itself. Couple that with the enormous amount of complexities that our daily use of language has. From humor to ambiguity to domain-specific language – all of that adds to largely unspoken rules most of us understand intuitively.

Intuitive understanding, however, isn’t all that useful when you need to convert those rules into mathematical representations. So, instead of attempting to input rules to machines themselves, the idea was to give them enough data to glean out the intricacies of language. Unavoidably, however, that means that machine learning models have to acquire lots of different expressions, uses, applications, and other aspects of language. There’s simply no way to provide all of these within a single text or even a corpus of texts.

Finally, most machine learning models face scaling law problems. Most business-folk will be familiar with diminishing returns – at some point, each invested dollar into an aspect of business will start generating fewer returns. Machine learning models, GPTs included, face exactly the same issue. To get from 50% accuracy to 60% accuracy, you may need twice as much data and computing power than before. Getting from 90% to 95% may require hundreds of times more data and computing power than before.

Currently, the challenge seems largely unavoidable as it’s simply part of the technology, it can only be optimised.

Web scraping and AI

It should be clear by now that no matter how many books were written before the invention of copyright, there wouldn’t nearly be enough data for models like GPT-4 to exist. The enormous requirements of data, and the existence of an OpenAI web crawler, outside of publicly available datasets, OpenAI (and likely many of their competitors) likely used web scraping to gather the information they needed to build their LLMs.

Web scraping is the process of creating automated scripts that visit websites, download the HTML file, and store it internally. HTML files are intended for browser rendering, not data analysis, so the downloaded information is largely gibberish. Web scraping systems also have a parsing aspect that fixes the HTML file so that only the valuable data remains. Many companies use already use these tools to extract information such as product pricing or descriptions. LLM companies parse and format content in such a way that it resembles regular text like a blog post. Once a website has been parsed, it’s ready to be fed into the LLM.

All of this is used to acquire the contents of blog posts, articles, and other textual content. It’s being done at a remarkable scale.

Problems with web scraping

However, web scraping runs into two issues. One, websites aren’t usually all that happy about a legion of bots sending thousands of requests per second. Second, there is the question of copyright. Most web scraping companies use proxies, intermediary servers, that make changing IP addresses easy, which circumvents blocks, intentional or not. Additionally, it allows companies to acquire localised data – extremely important to some business models such as travel fare aggregation.

Copyright is a burning question in both the data acquisition and AI model industry. While the current stance is that publicly available data, in most cases, is alright to scrape, there’s questions about basing an entire business model that, in some sense, uses the data to replicate the text through an AI model.

Conclusion

There are a few key technologies that have collided to create the current iteration of AI models. Most of the familiar ones are based on machine learning, particularly the transformer invention.

Transformers can take textual data and convert it into vectors, however, their key advantage is the ability to take larger pieces of text (such as sentences) and look at them in their entirety. Previous technologies usually were only capable of evaluating words themselves.

Machine learning, however, has the problem of being data-hungry and exponentially-so. Web scraping was utilized in many cases to acquire terabytes of information from publicly available sources.

All of that data, in OpenAI’s case, was cleaned up and fed into a GPT. They are then often fine-tuned through human intervention to get better results out of the same corpus of data.

Inventions like ChatGPT (or chatbots with LLMs in general) are simply wrappers that make interacting with GPTs a lot easier. In fact, the chatbot part of the model might just be the simplest part of it.

  • Data & AI

Related Stories

We believe in a personal approach

By working closely with our customers at every step of the way we ensure that we capture the dedication, enthusiasm and passion which has driven change within their organisations and inspire others with motivational real-life stories.