Noam Rosen, EMEA Director of HPC & AI at Lenovo ISG, unpacks the role of liquid cooling in helping data centre operators meet the growing demands of AI.

With businesses racing to harness the potential of generative artificial intelligence (AI), the energy requirements of the technology have come into sharp focus for organisations around the world. 

Training and building generative AI models requires not only a huge amount of power, but also dense computational resources packed into a small space, generating heat. 

The Graphics Processing Units (GPUs) used to deliver such technology are highly energy intensive, and as generative AI becomes more ubiquitous, data centres will need more power, and generate ever more heat. For businesses hoping to reap the rewards of generative AI, the need for new solutions to cool data centres is becoming urgent. 

Air cooling is no longer enough

Energy intensive Graphics Processing Units (GPUs) that power AI platforms require five to 10 times more energy than Central Processing Units (CPUs), because of the larger number of transistors. This is already impacting data centers. 

There are also new, cost-effective design methodologies incorporating features such as 3D silicon stacking, which allows GPU manufacturers to pack more components into a smaller footprint. This again increases the power density, meaning data centers need more energy, and create more heat. 

Another trend running in parallel is a steady fall in TCase (or Case Temperature) in the latest chips. TCase is the maximum safe temperature for the surface of chips such as GPUs. It is a limit set by the manufacturer to ensure the chip will run smoothly and not overheat, or require throttling which impacts performance. On newer chips, T Case is coming down from 90 to 100 degrees Celsius to 70 or 80 degrees, or even lower. This is further driving the demand for new ways to cool GPUs. 

As a result of these factors, air cooling is no longer doing the job when it comes to AI. It is not just the power of the components, but the density of those components in the data center. Unless servers become three times bigger than they were before, data centres need a way to remove heat more efficiently. That requires special handling, and liquid cooling will be essential to support the mainstream roll-out of AI. 

The dawn of liquid 

Liquid cooling is growing in popularity. Public research institutions were amongst the first users, because they usually request the latest and greatest in data center tech to drive high performance computing (HPC) and AI. Yet they tend to have fewer fears around the risk of adopting new technology. 

Enterprise customers are more risk averse. They need to make sure what they deploy will immediately provide return on investment. We are now seeing more and more financial institutions – often conservative due to regulatory requirements – adopt the technology, alongside the automotive industry. 

The latter are big users of HPC systems to develop new cars, and now also the service providers in colocation data centers. Generative AI has huge power requirements that most enterprises cannot fulfil within their premises, so they need to go to a colocation data center, to service providers that can deliver those computational resources. Those service providers are now transitioning to new GPU architectures, and to liquid cooling. If they deploy liquid cooling, they can be much more efficient in their operations. 

Cooling the perimeter

Liquid cooling delivers results both within individual servers and in the larger data centers. By transitioning from a server with fans to a server with liquid cooling, businesses can make significant reductions when it comes to energy consumption. 

But this is only at device level, whereas perimeter cooling – removing heat from the data center – requires more energy to cool and remove the heat. That can mean data centres can only use two thirds of the energy it consumes on towards computing: the task it was designed to do. The rest is used to keep the data center cool.

Power usage effectiveness (PUE) is a measurement of how efficient data centers are. You take the power required to run the whole data center, including the cooling systems, divided by the power requirements of the IT equipment. With data centers that are optimised by liquid, some of them are doing PUE of 1.1, and some even 1.04, which means a very small amount of marginal energy. That’s before we even consider the opportunity to take this hot liquid or water coming out of the racks, and reuse that heat to do something useful, such as heating the building in the winter, which we see some customers doing today. 

Density is also very important. Liquid cooling allows us to pack a lot of equipment in a high rack density. With liquid cooling, we can populate those racks and use less data center space overall, less real estate, which is going to be very important for AI.

An essential tool

With generative AI’s energy demands set to grow, liquid cooled systems will become an essential tool to deliver energy efficient AI today, and also to scale towards future advancements. Air cooling is simply no longer up to the job in the era of energy-hungry generative AI. 

The emergence of generative AI has put the power demands of data centres under the spotlight in an unprecedented way. For business leaders, this is an opportunity to act proactively, and embrace new technology to meet this challenge. 

  • Data & AI
  • Infrastructure & Cloud

Related Stories

We believe in a personal approach

By working closely with our customers at every step of the way we ensure that we capture the dedication, enthusiasm and passion which has driven change within their organisations and inspire others with motivational real-life stories.