Paiton: Dramatically Faster Startup and Performance for Llama-3.1-405B
With Paiton, we’re not merely pursuing peak inference speeds, we’re fundamentally reshaping the entire lifecycle of large language model (LLM) deployment. Our latest endeavor pairs AMD’s cutting-edge MI300X GPUs with the colossal Llama-3.1-405B-Instruct-FP8-KV model, achieving groundbreaking milestones:
- Instant-On Startup: Significantly reducing cold-start delays for massive LLM deployments.
- Advanced Tensor Parallelism (TP): Dramatically enhancing inference throughput and slashing latency through sophisticated TP optimizations.
Visual Demonstration: Startup Speed Showcase
We’re excited to share a visual demonstration of Paiton’s revolutionary startup performance. Watch below how Paiton transforms a typically sluggish startup process into an agile, responsive experience. After startup we showcase the first inference run as well to show you the whole picture, from startup to first request.
Benchmarking Testbed & Methodology
Ensuring transparency and reproducibility, our benchmarking approach includes detailed specifications:
- Inference Library: vLLM v0.9.0 with amd/Llama-3.1-405B-Instruct-FP8-KV
- Hardware: 8 × AMD MI300X GPUs (192 GB total HBM3 memory)
- Software Stack: ROCm 6.3.1 on Ubuntu 22.04 (notably still utilizing an older driver stack)
- Batch Size: 32, representative of realistic, interactive AI workloads
- Measurements: Comprehensive averages over 10 runs, meticulously covering startup times, cold-start and steady-state TTFT, and end-to-end latency metrics
Key Highlight: Paiton consistently delivers stable and reliable performance, eliminating variability common in other inference solutions.
LLM Startup: The Critical Bottleneck
Deploying large-scale LLMs like Llama-3.1-405B-Instruct-FP8-KV is a significant engineering challenge. Startup delays commonly arise from:
- Model Weight Loading: Transferring massive sets of parameters from storage to GPU memory.
- Graph Compilation: Transforming high-level model definitions into optimized execution plans.
- Initial Warm-up: Performing preliminary inferences to reach peak operational efficiency.
These delays directly impact scalability, developer productivity, operational cost-efficiency, and end-user experience.
Instant-On Startup: Paiton’s Strategic Advantage
Paiton uniquely harnesses AMD’s GPU architecture combined with proprietary optimizations to substantially reduce startup times:
- Hyper-Optimized Weight Loading: Leveraging AMD’s ultra-fast HBM3 memory with parallel data transfers.
- Accelerated Graph Compilation: Custom routines that entirely eliminate compilation wait times.
- Intelligent Warm-up: Advanced priming strategies guaranteeing immediate and sustained responsiveness.
Startup Comparison: Llama-3.1-405B-Instruct-FP8-KV
Stage | Standard vLLM (sec) | AMD + Paiton (sec) | Improvement |
Model Weight Load | 71.24 | 64.28 | 9.7% Faster |
Memory Profiling | 69.97 | 38.63 | 44.8% Faster |
Graph Compilation | 27 | 0.00 | 100% Faster |
Initial Warm-up | 98.37 | 40.59 | 58.7% Faster |
Total Startup | 266.58 | 143.50 | 46.2% Faster |
Real-world Impact: Deploy a fully operational 405B parameter LLM in less than 2.4 minutes, significantly outperforming traditional deployment methods.
Deep Dive into Tensor Parallelism: The Paiton Edge
Tensor Parallelism is vital for harnessing the power of multi-GPU configurations. At Paiton, we’ve invested extensive effort in deeply optimized kernel development and an enhanced communication layer specifically tailored for AMD’s MI300X GPUs. Our proprietary approach to TP provides unparalleled performance:
- Highly Optimized Kernels: Precision-crafted to maximize GPU compute efficiency and reduce intra-node latency.
- Advanced Communication Layer: Significantly streamlined inter-GPU communication, drastically reducing overhead.
- Scalable Architecture: Consistent, predictable scaling even in complex multi-GPU deployments.
Sustained Performance Gains with Paiton’s TP
Metric | Paiton Avg | Improvement vs Aiter |
Request Throughput (req/s) | 2.35 | +20.5% Faster |
Output Token Throughput (tok/s) | 462.52 | +13.4% Faster |
Total Token Throughput (tok/s) | 1009.91 | +17.3% Faster |
Mean TTFT (ms) | 2581.94 | 42.8% Faster |
Mean E2EL (ms) | 11051.53 | 24.3% Faster |
Mean ITL (ms) | 43.28 | 10.6% Faster |
Paiton’s combination of rapid deployment capabilities and superior runtime performance uniquely positions AMD-based infrastructure for enterprise-grade AI deployments.
Shaping the Future of LLM Deployment
Our significant breakthroughs with Llama-3.1-405B-Instruct-FP8-KV represent a transformative shift, providing unprecedented agility, efficiency, and scalability for deploying large-scale AI workloads.
Transparency: Our Commitment to Authenticity
At Paiton, we pride ourselves on real, measurable results. Our journey is distinct:
- Lean & Independent: Just 3 engineers, entirely self-funded with zero external investments or support.
- Self-Reliance: No financial, technical, or promotional assistance from AMD or other external entities; fully self-financed investment in our MI300X hardware.
- Proven Expertise: Previously shipped over 250,000 GPUs, thousands of AMD EPYC CPUs, and numerous AI servers, successfully spinning up large-scale AI clusters worldwide.
- Original Innovation: Unlike many startups leveraging open-source software or superficial wrappers, we build everything, including deep kernel optimizations, from scratch.
- Direct Message to AMD: While AMD’s Aiter library is commendable, our compact team achieves consistently superior performance, demonstrating efficiency and innovation unmatched even by larger, funded teams.
- Results Over Hype: We never shout before we deliver. Unlike others that secure millions in funding with limited tangible outcomes, we achieve groundbreaking results first and let those results speak for themselves.
Together, let’s redefine what’s achievable with cutting-edge AI technology and AMD hardware.