#vllm

[ follow ]
fromHackernoon
1 month ago

KV-Cache Fragmentation in LLM Serving & PagedAttention Solution | HackerNoon

Prior reservation wastes memory even if the context lengths are known in advance, demonstrating the inefficiencies in current KV-cache allocation strategies in production systems.
Scala
Marketing tech
fromTechzine Global
4 months ago

Microsoft expands AKS with RAG functionality and vLLM support

Microsoft enhances Azure Kubernetes Service with RAG support in KAITO, enabling advanced search capabilities for developers.
vLLM serving engine improves processing speed for model inference workloads in Azure Kubernetes Service.
Miscellaneous
fromHackernoon
1 year ago

The Distributed Execution of vLLM | HackerNoon

Large Language Models often exceed single GPU limits, requiring advanced distributed execution techniques for memory management.
[ Load more ]