Clockwork launches FleetIQ to dramatically improve AI training
Briefly

Clockwork launches FleetIQ to dramatically improve AI training
"In large-scale AI training, billions in computing power are currently lost due to communication problems between processors. Clockwork tackles this problem with FleetIQ, a Software-Driven Fabric that provides real-time insight into GPU clusters. The system detects bottlenecks within microseconds and automatically redirects traffic via alternative routes. In addition, stateful fault tolerance prevents entire AI jobs from having to be restarted after a failure."
"AI training has become a communication problem. Whereas pure computing power used to be the bottleneck, the challenge now lies in synchronizing thousands of GPUs in a cluster. If one connection falters, the entire system comes to a standstill. This negates the billions of dollars in combined hardware costs and power consumption. GPU clusters only achieve 30 to 55 percent of their theoretical performance."
FleetIQ is a software layer that increases GPU cluster efficiency by providing a Software-Driven Fabric with real-time observability. The system detects bottlenecks within microseconds and automatically redirects traffic via alternative routes while offering stateful fault tolerance to avoid restarting entire AI jobs after failures. FleetIQ is hardware-agnostic and supports Nvidia and AMD processors, InfiniBand and Ethernet, and both on-premises and cloud deployments. GPU clusters currently achieve only 30 to 55 percent of theoretical performance, causing multibillion-dollar wasted capacity at large scale. Early adopters like Uber have shortened problem diagnosis from hours to minutes using FleetIQ observability tooling.
Read at Techzine Global
Unable to calculate read time
[
|
]