Evolution of liquid cooling systems

AI Data Centre Liquid Cooling: Performance, Reliability and Safe Maintenance

By Emiliano Torresi - Product Management & Marketing Director, Faster Srl

Liquid cooling is quickly establishing itself as a core element of data centre infrastructure, as AI workloads push thermal requirements beyond traditional design assumptions. Whether in new builds or retrofitted environments, liquid-based systems offer a direct and efficient way to remove heat at the source—supporting higher performance, improved energy efficiency, and tighter temperature control in increasingly dense compute environments. This shift is not happening in isolation. It reflects a broader transformation in how data centres are designed and operated, where thermal management is becoming closely integrated with system architecture and long-term operational strategy. However, adopting liquid cooling is not only a thermal decision. It is an operational one. As fluid systems move closer to compute, they become part of the infrastructure layer—subject to the same expectations of uptime, reliability, and risk control as any other critical system. For operators, the focus is no longer whether liquid cooling will be deployed, but how it will behave under real operating conditions, across different workloads and usage scenarios.

From Thermal Management to System Reliability

Direct-to-chip liquid cooling has changed how heat is managed. By moving from room-level to component-level cooling, it improves efficiency but also introduces tighter tolerances across the system. Cooling is no longer peripheral. It is embedded within the IT hardware. This shift increases the importance of every connection point within the cooling loop. Interfaces between servers, racks, and distribution systems must operate consistently to maintain system integrity and uptime. What were once considered standard mechanical components now play a more critical role in overall reliability. Maintenance also changes. What was once a routine task becomes an operation that must be carried out without risk. The data centre becomes a continuity-critical environment, where even minor issues can lead to service interruptions. Designing for this context requires balancing performance, safety, and serviceability from the start.

Performance Under Increasing Density

AI-driven infrastructure is placing new demands on cooling systems. The challenge is not only removing heat, but doing so consistently across dynamic workloads and increasingly dense configurations. Flow rate, pressure stability, and thermal uniformity all become critical factors. Any restriction or inefficiency within the cooling loop can affect overall system performance and reduce the effectiveness of the cooling strategy. For this reason, system design is increasingly supported by simulation tools such as Computational Fluid Dynamics (CFD), which allow engineers to analyse flow behaviour and optimise layouts before deployment. This includes not only the main cooling architecture, but also the performance of connection points and interfaces, which can influence flow distribution and efficiency. At the infrastructure level, higher flow requirements extend to distribution systems and CDUs. Components must be capable of handling increased volumes of coolant while maintaining stable and efficient operation across the entire system, even as demand fluctuates over time.

Safety as a Design Principle

If performance defines capability, safety defines viability. Liquid cooling systems in data centres are built around a strict requirement: zero leakage. Even minimal fluid escape can damage electronic components and disrupt operations. This makes sealing performance a critical aspect of system design, particularly at connection points within the cooling loop. Validation standards in this context are significantly stricter than in traditional industrial applications. Helium leak testing is used to detect extremely small defects, ensuring sealing performance at very low thresholds. Surface finishing is equally important, with tight control of Ra and Rz values to avoid micro-leak paths. Material selection and manufacturing processes must also support consistent performance over time, reducing the risk of failure in sensitive environments and ensuring long-term reliability.

Serviceability in Always-On Environments

Data centres are designed to operate continuously, which means infrastructure must support intervention without disruption. This makes serviceability a key requirement in liquid-cooled environments. Systems must allow components to be connected and disconnected safely, without introducing risk or downtime. The ability to isolate and access specific parts of the cooling loop is essential for maintenance, upgrades, and repairs. This requires connection and disconnection solutions—such as fluid couplings—that ensure controlled, repeatable operations, allowing systems to be disconnected safely and without leakage. As rack density increases and system complexity grows, the importance of consistent behaviour during maintenance becomes more pronounced. Reliability must extend beyond normal operation to include every interaction with the infrastructure, particularly during planned or unplanned interventions.

Standardisation and Interoperability

As liquid cooling adoption grows, standardisation is becoming increasingly important. Industry initiatives such as the Open Compute Project (OCP) are defining shared specifications for liquid cooling systems, helping to ensure compatibility across vendors and simplify integration. For operators, this is particularly relevant in environments that evolve over time. Many facilities are being upgraded incrementally, rather than replaced entirely. In this context, interoperability and retrofit capability become key considerations. Components must integrate with existing infrastructure while supporting future scalability, reducing complexity and enabling smoother transitions as technology requirements change.

Precision Manufacturing and Operational Outcomes

In high-density environments, reliability depends on consistency at the smallest scale. For couplings, manufacturing precision is critical. Surface defects or handling damage can compromise sealing performance. To address this, production processes focus on maintaining component integrity throughout the entire lifecycle. Controlled handling prevents damage during machining, cleaning, and storage, reducing the risk of defects affecting performance. As systems become more complex, acceptable variation decreases. Precision is no longer optional.

Bridging Industrial Expertise and Digital Infrastructure

Liquid cooling combines multiple engineering disciplines, including fluid dynamics, sealing technology, and material science. Many of these capabilities come from sectors such as hydraulics and industrial engineering. In data centres, however, the margin for error is much smaller. Systems must operate reliably under conditions where even minor deviations can have consequences. This requires a combination of mechanical reliability, precise manufacturing, and a deep understanding of operational requirements. Companies with experience in fluid systems, such as Faster Srl, are increasingly contributing to this transition by adapting established engineering principles to the specific demands of digital infrastructure.

Conclusion: Designing for AI-Ready Infrastructure

Liquid cooling is becoming a defining element of modern data centre design. Its effectiveness depends not only on thermal performance, but on how well systems operate under real conditions—across installation, operation, and maintenance. As AI continues to drive higher density and greater complexity, the focus is shifting towards integrated system design, where performance, safety, and serviceability are considered together. For operators, the challenge is not simply adopting new technologies, but ensuring they work reliably within existing and evolving infrastructure. This requires careful attention to every part of the system, from high-level architecture to individual components and interfaces. In this context, reliability is not defined by a single element, but by how well the entire system performs over time.