The journey of accelerated LLM inference

Written by

AI Team

Published on

Nov 17, 2023

Introduction

The pursuit of performance in Perplexity’s answer engine drives us to adopt the latest technology that NVIDIA and AWS have to offer. In this blog, we are excited to share the results of our latest experiments: a comparison of Llama 2 70B inference across various hardware and software settings.

Our LLM inference platform, pplx-api, is built on a cutting-edge stack powered by open-source libraries. In the time since pplx-api’s public beta began in October, we’ve been tackling scaling challenges and learning how best to tune our configuration to achieve massive scale. This led us to run experiments with the following guiding questions:

  1. What is the raw performance gain from switching our GPUs from NVIDIA A100 to NVIDIA H100, all other settings remaining the same?

  2. What is the efficiency gain of 8-bit floating point (fp8) quantization, which H100 adds native support for? What is the accuracy cost of this quantization?

  3. How do tensor parallelism and batch size affect latency and token throughput?

  4. Considering the above, which configuration results in the most scalable balance of performance and cost-efficiency?

Experimental setup

We ran the following experiment as a series of local benchmarks to avoid network latency.

Key Metrics

  1. Latency: The total time it takes for the inference server to generate its full response.

  2. Throughput: The number of output tokens, per second, per GPU, that the inference server can generate across all users and requests.

Constants

The following factors would influence the key metrics, so we kept them consistent across different trials of the experiment.

AI Model

Performance scales with the size of the LLM. More parameters require more computations resulting in slower inference. For example, Llama 2 13B is faster than Llama 2 70B when other settings are equal. We stick to Llama 2 70B in this experiment because we want to optimize for serving the most capable open source models.

Input/Output Token Dataset

The amount of input and output tokens in each sample request/response pair can influence performance measurements. In general, output token generation dominates overall response time. When sampling data only induces “yes/no” responses from the LLM, then the response is faster compared to samples that ask the LLM to write essays. Our dataset is composed of synthetic requests with 1024 input tokens inducing 512 output tokens. This distribution was chosen to match the observed distribution of traffic on our public deployment of Llama2 70B.

Software Version

NVIDIA TensorRT-LLM (release v0.5.0) is an open-source library for optimizing LLM inference. Released in late 2023, it synthesizes NVIDIA’s many inference optimizations and provides a flexible layer of customization for the key parameters of this experiment: batch size, quantization, and tensor parallelism.

Variables

We experimented across 4 axes of configuration: tensor parallelism, GPU architecture, quantization, and max batch size. These axes are interconnected because they each represent a tradeoff with respect to the critical bottleneck resource of GPU memory.

GPU architecture

The ninth-generation Hopper (H100-HBM3-80GB / p5.48xlarge) GPU architecture packs a huge list of features over its predecessor, Ampere (A100-SXM4-80GB / p4de.24xlarge), including 2x-6x computation rates and nearly 2x GPU memory bandwidth. GPU memory bandwidth is a critical metric for inference because a primary latency bottleneck of inference’s matrix multiplications comes from loading gigabytes of model data from GPU memory into compute registers. Based on these stats, we hypothesized that an apples-to-apples comparison of NVIDIA H100 and A100 will exhibit 2x improvement in both latency and throughput.

Another key difference between the NVIDIA H100 and A100 is that the H100 tensor core natively adds support for 8-bit floating point (fp8) instructions, which opens the door to further optimizations detailed below. This is why we use fp8 and fp16 specifically for the H100.

To keep memory-per-GPU consistent in this experiment, we stuck to nodes with 8x80GB GPU memory for both our NVIDIA A100s and H100s. In addition to enabling higher batch sizes, GPU memory is important because the model’s parameters are loaded into GPU memory during server startup for fast access. For example, if each of the 70 billion parameters in our model is a 16-bit floating point number, then the model is around 140GB in size, which does not fit on a single GPU. Hence the need for tensor parallelism, which we explain below.

Tensor Parallelism

Tensor parallelism refers to the number of GPU devices consumed to run the inference server. When we allocate a number of GPUs, TensorRT-LLM pools their resources together to help us reach the minimum required memory budget for running Llama2 70B. Our hypothesis is that lower tensor parallelism will result in higher latency (due to fewer resources consumed to satisfy each batch) but higher throughput per GPU (due to better utilization) when compared to higher tensor parallelism.

Quantization

Quantization is the reduction of precision in the weights and activations used by neural networks. We use this technique to halve the GPU memory consumption when we switch from fp16 to fp8. This makes it possible to run the same model with lower total GPU memory usage, enabling lower tensor parallelism, which drives up throughput.

Implementations of quantization have the potential to degrade accuracy. Thus, we evaluated accuracy for different precisions by comparing their perplexity statistic, a measure of how well the LLM predicts each next token in a sentence, on the WikiText corpus. For 8-bit floating point and 8-bit weight with 8-bit activation and SmoothQuant (w8a8 SQ), there was no significant change in perplexity (< 1%) compared to fp16 on WikiText, so we felt confident to proceed. However, w4a16 exhibited a substantial 7% change in perplexity, potentially attributable to the even lower precision and necessary dynamic conversions between int4 and fp16.

Batch Size

Parallelism via batching is a classic strategy to squeeze performance out of a resource constrained system. By processing multiple requests in each forward pass through the neural network, batching is known to increase throughput at the cost of some latency. Batching also incurs higher GPU memory consumption because the size of the KV cache which manages the attention mechanism grows linearly with the batch size.

In the case of Llama 2 70B (which has 80 layers), fp16 with batch size 32 for 4096 context size, the size of the KV cache comes out to a substantial 40 GB. This ends up preventing Llama 2 70B fp16, whose weights alone take up 140GB, from comfortably fitting into the 160GB GPU memory available at tensor parallelism 2 (TP-2).

Results

We swept through compatible combinations of the 4 variables of the experiment and present the most insightful trends below.

Figure 1 - The latency of requests with varying batch size across five different configurations, all with tensor parallelism 8, which yields the best latency with 8 available GPUs. Within each configuration, latency generally doubles when increasing batch size from 1 → 32, and doubles again from 32 → 128. On sufficiently large batch sizes, H100 approximately halves the latency compared to A100. A100 uses mixed precision because the architecture lacks native support for fp8. w8a8 with SmoothQuant (SQ) is meant to resemble fp8.

The latency improvement of quantization is in the neighborhood of 10% when comparing H100 fp16 → fp8 and A100 fp16 → w8a8 with SmoothQuant. However, the mixed precision w4a16 actually performs better at low batch sizes and worse at higher batch sizes compared to fp16. This may be due to a number of factors, including less optimized compute kernels, casting time between int4 and fp16, and the fact that w4a16 still uses 16-bit floating points for activations, resulting in no savings in the dimensions of the KV cache. Because w4a16 also demonstrated lower accuracy, we conclude we should stick to w8a8 SQ for A100s and fp8 for H100s.


Figure 2 - The throughput across TP-8 configurations with different architecture, quantization, and batch size. For each architecture and quantization, the batch size was chosen as the largest which honored a latency requirement of 25600ms (20 tokens per second for 512 tokens), so that we compare configurations having similar latency. Under this requirement, H100 with BS-128 reaches 228% throughput compared to A100 BS-64 using the same quantization (fp16) and even has lower response latency despite the doubled batch size. Quantization with fp8 improves this factor to 251%.

In our first two figures, we only present configurations of TP-8. H100 achieves 54% latency and 184% throughput compared to A100 when both use fp16 / BS-128 / TP-8, which improves to 49% latency and 202% throughput when using fp8 on H100. This improvement in performance can be attributed to the increases in computation power and memory bandwidth between H100 and A100. Notably, the difference is less pronounced under lower batch sizes where utilization may be lower.

As we build our platform, we want to honor certain latency requirements for our users while maximizing throughput. Thus, rather than compare A100 vs. H100 at the same batch size, it actually makes more sense for us to compare their throughput only under configurations where they satisfy a latency requirement. We set a cutoff at 25600ms latency for completion of the 512 output tokens and found that H100 / TP-8 / fp8 / BS-128 yields 251% throughput compared to A100 / TP-8 / fp16 / BS-64, since it’s able to process double the batch size at a lower latency. Given that quantization provides GPU memory savings, we now need to evaluate how tensor parallelism can add a next layer of optimization.


Figure 3 - The latency across varying batch sizes and tensor parallelism for H100 fp8. Latency generally doubles when increasing batch size from 1 → 32, and doubles again from 32 → 128. TP-2 is consistently around twice as slow as TP-8 when batch sizes are equal. Doubling the tensor parallelism while doubling the batch size keeps latency relatively stable as the number of batches processed per GPUs remains even.

When it comes to quantization and architecture, there are clear winners: H100 dominates A100 and lowered-precision quantization improves memory utilization, latency, and throughput. However, batch size and tensor parallelism present a tradeoff in our key metrics. A larger batch size optimizes for throughput at the cost of increased latency and memory consumption. On the other hand, higher tensor parallelism increases the overall pool of available memory and optimizes for latency, but trades off with throughput due to synchronization costs of distributed matrix multiplication and by virtue of locking up more GPU resources.

Figure 4 - The throughput across varying batch sizes and tensor parallelism for H100 fp8. The highest throughput comes from TP-2 BS-128, at 460% compared to the baseline of A100/TP-8/fp16/BS-64. However, TP-2 BS-128 is also the slowest result in Figure 3.

The throughput-maximizing configuration of our experiment is H100 / fp8 / TP-2 / BS-128, at 767 output tokens per second per GPU. This is a 460% improvement over A100 / fp16 / TP-8 / BS-64. However, it comes at the cost of doubled latency - closer to 42000ms for 512 output tokens - so it may be unsuitable as a production configuration. The results of TP-4 BS-128 (626 tok/sec/gpu at 26188ms response time) and TP-2 BS-32 (435 tok/sec/gpu at 18821ms response time) may represent better tradeoffs on our key metrics.

Conclusion

Our results demonstrate that:

  • We reach 54% latency and 184% throughput using H100 compared to A100 given the same configuration, which improves to 49% and 202% respectively when H100 takes advantage of its native support for fp8.

  • When maximizing throughput subject to a latency constraint, H100 / fp8 / TP-8 / BS-128 yields 251% throughput compared to A100 / fp16 / TP-8 / BS-64, as it can process double the batch at a faster speed.

  • Taking advantage of H100 with TP-2 with fp8, we can achieve 373% the throughput of A100 / fp16 / TP-8 / BS-128, with less than a 10% increase in latency.

  • Batch size and tensor parallelism present a tradeoff between throughput and latency to the operator of an LLM inference system.

These results make us feel confident about a full transition to H100 GPUs in our previously A100 powered hardware stack. We are excited to be able to confirm the performance gains advertised by NVIDIA and look forward to their next breakthroughs in accelerated hardware.

What’s Next

Our next frontier for optimization is to examine the accuracy and performance impact of structured sparsity and int4 precision, which could significantly reduce Llama 2 70B’s GPU memory footprint and yield up to 2x improvements in latency. We are excited to continue the journey of blazing fast inference and hope to empower our users to build powerful applications on top of our platform.

In the near term, pplx-api will be lifting rate limits and offering more custom Perplexity LLMs, including an internet-powered LLM with grounding for facts.

Sign up for Perplexity Pro at perplexity.ai/pro. Get access to our cutting-edge pplx-api, and leverage these advanced capabilities in your projects. Discover more about pplx-api on our blog.

Interested in shaping the future of AI? We’re hiring! Be part of a team driving massive-scale, generative LLM infrastructure. Explore opportunities at Perplexity Careers.

Authors
Aarash Heydari, Grigorii Alekseev, Kevin Hu, Denis Yarats