Thinking Machines wants to make AI more predictable
Briefly

Thinking Machines wants to make AI more predictable
"Because additions with floating point numbers are not associative, the order of calculations can cause small differences. Combined with the fact that GPUs execute thousands of threads in parallel and the order of execution is not always the same, this seemed to be the logical explanation. However, the new research shows that this picture is not entirely accurate. Many GPU kernels do indeed deliver bit-identical results when executed multiple times with the same input."
"This means that the outcome of a calculation for a single input can change depending on the batch size in which that input is processed or the number of other requests running simultaneously on the server. Three core components of transformer architectures appear to be sensitive: RMSNorm, matrix multiplication, and attention. The way these operations are optimized for performance means that the calculation order can change with different batch sizes, which in turn leads to minor rounding differences that ultimately become visible in the output."
Thinking Machines Lab, founded by Mira Murati, targets unpredictability in AI model outputs by addressing GPU-induced randomness. Earlier explanations focused on floating-point rounding and non-associative additions combined with unpredictable thread ordering on GPUs. Many GPU kernels, however, can produce bit-identical outputs when run repeatedly with the same input. The primary source of variability is lack of batch invariance: single-input outcomes change with batch size or concurrent requests. RMSNorm, matrix multiplication, and attention in transformer architectures are sensitive. Rewriting these operations to enforce consistent reduction and addition order across batch sizes eliminates minor numerical differences and yields truly deterministic results.
Read at Techzine Global
Unable to calculate read time
[
|
]