Able to understand multiple types of input, multi-modal models represent the next big step in generative AI refinement.

Generative artificial intelligence (AI) has arrived. However, if 2022 was the year that generative AI exploded into the public consciousness, 2023 was the year the money started rolling in. Now, 2024 is the year when investors start to scrutinise their returns. PitchBook estimates that generative AI startups raised about $27 billion from investors last year. OpenAI alone was projected to rake in as much as $1 billion in revenue in 2024, according to Reuters.

This year, then, is the year that AI takes all-important steps towards maturity. If generative AI is to deliver on its promises, it needs to develop new capabilities and find real-world applications.

Currently, it looks like multimodal AI is going to be the next true step-change in what the technology can deliver. If investor are right, multimodal AI will deliver the kind of universal input to universal output functionality that would make Generative AI commercially viable.

What is multimodal AI? 

A multimodal AI model is a form of machine learning that can process information from different “modalities”. This includes images, videos, and text. They can then, theoretically, spit out results in a variety of formats as well. 

For example, an AI with a multimodal machine meaning model at its core could be fed a picture of a cake and generate a written recipe as a response and vice versa.

Why is multimodal AI a big deal? 

Multimodal models represent the next big step forward in how developers enhance AI for future applications. 

For instance, according to Google, its Gemini AI can understand and generate high-quality code in popular languages like Python, Java, C++, and Go, freeing up developers to create more feature-rich apps. This code could be generated in response to anything from simple images to a voice note. 

According to Google, this brings us closer to AI that acts less like software and more like an expert assistant.

“Multimodality has the power to create more human-like experiences that can better take advantage of the range of senses we use as humans, such as sight, speech and hearing,” says Jennifer Marsman, principal engineer for Microsoft’s Office of the Chief Technology Officer, Kevin Scott.

  • Data & AI

Related Stories

We believe in a personal approach

By working closely with our customers at every step of the way we ensure that we capture the dedication, enthusiasm and passion which has driven change within their organisations and inspire others with motivational real-life stories.