#llm-inference

[ follow ]
fromInfoQ
1 day ago

NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference

The new capabilities center on two integrated components: the Dynamo Planner Profiler and the SLO-based Dynamo Planner. These tools work together to solve the "rate matching" challenge in disaggregated serving. The teams use this term when they split inference workloads. They separate prefill operations, which process the input context, from decode operations that generate output tokens. These tasks run on different GPU pools. Without the right tools, teams spend a lot of time determining the optimal GPU allocation for these phases.
Artificial intelligence
Artificial intelligence
fromTheregister
3 weeks ago

Nvidia says DGX Spark is now 2.5x faster than at launch

Nvidia's DGX Spark and GB10 systems gain significant software-driven performance improvements and broader software integrations, boosting prefill compute performance for genAI workflows.
Python
fromPyImageSearch
3 months ago

Introduction to KV Cache Optimization Using Grouped Query Attention - PyImageSearch

Grouped Query Attention reduces KV cache memory by letting multiple query heads share fewer KV heads, lowering memory use with minimal accuracy loss.
fromTheregister
3 months ago

DGX Spark Nvidia's desktop supercomputer: first look

But the machine is far from the fastest GPU in Nvidia's lineup. It's not going to beat out an RTX 5090 in large language model (LLM) inference, fine tuning, or even image generation - never mind gaming. What the DGX Spark, and the slew of GB10-based systems hitting the market tomorrow, can do is run models the 5090 or any other consumer graphics card on the market today simply can't.
Artificial intelligence
Artificial intelligence
fromInfoQ
4 months ago

Disaggregation in Large Language Models: The Next Evolution in AI Infrastructure

Disaggregated serving separates LLM prefill and decode onto specialized hardware, improving throughput, latency variance, and reducing infrastructure costs by optimizing hardware allocation.
Scala
fromHackernoon
1 year ago

Related Work: vAttention in LLM Inference Optimization Landscape | HackerNoon

Efficient optimization of LLM inference is essential for reducing latency and improving performance in AI applications.
[ Load more ]