In a previous post Calculating LLMs GPU Sizing, I discussed the rule of thumb used to calculate vRAM requirements for LLM (Large Language Model) systems. Building on that, in this post, I try to explain what happens during inference of decoder-only models for further GPU performance optimization.
LLM inference for decoder-only transformer based models consists of two main stages: the prefill and the decoding stage. The prefill stage sets things up for token generation and impacts TTFT (Time to First Token). The decode stage is responsible for token generation and it impacts TPS (Tokens Per Second).
Prefill Stage - Prompt Processing
In this stage, the model processes the entire input prompt. It is called prefill as the LLM is calculating the Key-Value (KV) Cache; filling intermediate states for all input tokens. (Attention Mechanism). This step is compute-intensive; FLOPs over memory bandwidth. The KV cache is calculated at once using a parallel forward pass over all input tokens.
The number of input tokens predominantly determine the duration of the prefill stage. It may be important to keep that in mind when implementing RAG systems or saving chat history.
The latency of the prefill stage is the same as TTFT, which is paramount in interactive streaming applications as well as in classification or scoring scenarios were the output is a single output token.
Optimization Strategies for TTFT
Maximizing computational throughput by using efficient hardware is the first step; software side there are other optimization strategies that can be used such as:
KV Cache Reuse
Typically KV cache is computed and stored for each new and unique prompt to be used later bt the decoding phase. However, when implementing system prompts or context tracking, it can cause a lot of duplicate calculations. Sharing the KV cache state to be reused by different requests that partially starts with the same prompt can minimize latency. The level of latency improvements will be based on how similar the prompt in the new request to the previous prompt.
Not all models support KV cache reuse. Paged context attention in the model architecture is a requirement. [2]
Chunked Prefill
Context length is a major bottleneck in the prefill stage. The larger the input prompt the more GPU compute is needed to calculate the KV cache. Chunked prefill were the input is divided into chunks is used a solution to reduce latency in the prefill stage. There is a tradeoff between the number of chunks and the latency of the decoding phase. The larger the chunk size the faster the TTFT. However, the decoding will take longer. Some LLM inference servers like TensorRT handle chunked prefill dynamically based on available resources.
Prefill and Decode Stage Disaggregation
This advanced strategy involves separating the computational resources for the prefill and decode stages. Because the prefill stage is compute-bound and the decode stage is memory-bandwidth-bound, they have different hardware requirements. By disaggregating them, you can route prefill operations to hardware with high processing power (e.g., GPUs with high TFLOPs) and decode operations to hardware with high memory bandwidth. This optimizes resource allocation, improves efficiency, and can reduce overall costs by preventing one stage from being bottlenecked by hardware better suited for the other.
On the other hand, it introduces significant overhead and complexity. Moving multi-GB KV caches between GPUs requires ultra-fast interconnects, and replicating models doubles memory use. Empirical studies warn that if the workload is not large or not well-balanced, disaggregation can hurt performance. In practice, frameworks often use conditional or hybrid schemes (only splitting when beneficial).
Prompt Engineering
While often seen as a tool for improving model accuracy, prompt engineering is also a powerful lever for performance optimization. The core idea is to reduce the number of input tokens without losing critical context. A shorter prompt requires less computation for the KV cache, directly resulting in a lower TTFT. This can be achieved by:
-
Conciseness: Rephrasing verbose prompts into direct and efficient instructions.
-
Strategic Context: Carefully selecting what context or examples (in few-shot scenarios) are necessary, as every extra token adds to the prefill latency.
Using a Production LLM Server
A production-level inference server can integrate many of these optimizations out of the box. For example, NVIDIA's Triton Inference Server is an open-source server that can host TensorRT-optimized LLMs and supports features including dynamic batching, concurrent multi-GPU support, KV cache management, and flexible scheduling to serve a large number of requests at once.
However, the setup can be a bit challenging. Additionally, using this stack means you're bound to NVIDIA harwdare. In summary, you'll get way better TTFT and throughput by using a production LLM server like Triton and TensorRT-LLM on NVIDIA cards. The price you pay for that performance is a more complex setup and vendor lock-in.
What’s Next?
→ A deeper look into TPS, latency constraints, and token streaming.
