
"Using this technology, the GPU giant can now serve massive trillion parameter large language models (LLMs) at hundreds or even thousands of tokens a second per user, Ian Buck, VP of Hyperscale and HPC at Nvidia told press ahead of Huang's keynote Sunday."
"With up to 50 petaFLOPs each, Nvidia's newly announced Rubin GPUs aren't hurting for compute, but with 22 TB/s of HBM4 memory bandwidth, Groq's latest chip tech is nearly 7x faster, achieving 150 TB/s apiece. This makes Groq's LPU an ideal decode accelerator."
"By combining its GPUs with Groq's LPUs, Nvidia wagers inference providers will be able to charge as much as $45 per million tokens generated. To put that in perspective, OpenAI currently charges about $15 per million output tokens for API access to its top GPT-5.4 model."
Nvidia announced integration of Groq's language processing units (LPUs) into its Vera Rubin rack systems to enhance inference performance for trillion-parameter large language models. LPMs excel at ultra-low latency inference, a capability previously dominated by specialized chip makers like Cerebras and SambaNova. The combined GPU-LPU architecture leverages complementary strengths: Nvidia's Rubin GPUs handle compute-intensive prompt processing with 50 petaFLOPs, while Groq's LPUs accelerate the decode phase with 150 TB/s memory bandwidth—nearly 7x faster than Rubin's 22 TB/s. This hybrid approach enables inference providers to charge premium rates of approximately $45 per million tokens, significantly higher than OpenAI's current $15 per million tokens. The 256 LPUs per LPX rack connect to Vera Rubin systems via custom Spectrum-X interconnect.
#llm-inference-acceleration #gpu-lpu-hybrid-architecture #language-processing-units #token-generation-performance #ai-infrastructure
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]