Stop Overpaying: Paiton MI300X MoE Beats H200/B200 on $/1M Tokens

Short summary: We benchmarked Paiton with our new MoE support on Qwen/Qwen3-30B-A3B-Instruct-2507 to compare inference performance across several setups. Each configuration was run five times per batch size and we report the mean across runs.

Why this benchmark

Most published numbers use synthetic prompts or toy datasets. We focused on realistic conversational workloads (we always do) to highlight true latency and throughput behaviour. The model under test is Qwen/Qwen3-30B-A3B-Instruct-2507 running with Paiton’s custom-kernel runtime.

Meanwhile, billions are being funneled into shiny NVIDIA racks, yet here we are, crushing them with cheaper, “old-gen” AMD MI300X.

If your goal is outcomes per dollar, not logo per rack unit, you’re literally paying a premium to go slower.

While others rush to buy the newest GPUs, we’re focused on unlocking the full potential of both current and previous generations. GPU vendors often push new hardware before the previous generation is fully optimized and we’re here to change that.

Methodology

To ensure the validity and applicability of our findings, we adhered to a meticulous benchmarking methodology:

Model Under Test: We selected Qwen/Qwen3-30B-A3B-Instruct-2507, a representative large MoE model, to ensure our results are relevant to contemporary LLM deployments.
Dataset: Instead of synthetic data, we utilized realistic conversational traces from the ShareGPT-dataset. This choice is critical for accurately assessing performance in scenarios that mimic actual user interactions.
Output Length: To maintain consistency and reflect typical conversational turn lengths, the output length was capped at 256 tokens for all inference runs.
Hardware & Software Configurations:
- AMD MI300X (Paiton Optimized):
  - ROCm Versions: Tested with both 6.4.1 and 7.0.0 to evaluate the impact of different software stacks.
  - VRAM: 192 GB HBM, providing substantial memory capacity for large models.
  - Software Stack: Paiton’s custom-kernel runtime with MoE support, alongside stock vLLM v0.10.0 and other competitor stacks for a holistic comparison.
- NVIDIA H200 (Competitor Reference):
  - CUDA Version: CUDA 13, representing the latest NVIDIA software environment.
  - VRAM: 141 GB HBM.
  - Software Stack: vLLM 0.10.2 and competitor stacks.
- NVIDIA B200 (Competitor Reference):
  - CUDA Version: CUDA 13.
  - VRAM: 180 GB HBM.
  - Software Stack: vLLM 0.10.2 built from source and competitor stacks.
Benchmark Procedure: For each batch size in the set {1, 2, 4, 8, 16, 24, 32, 64, 128, 256}, inference jobs were executed five times end-to-end. Throughput (tokens/sec) and latency were meticulously recorded, with the mean across runs reported to mitigate statistical variance. Crucially, tokenization and generation settings were kept identical across all setups to ensure a fair comparison.

Results

Throughput and Cost-Efficiency Analysis

The following table presents the mean throughput (tokens/sec) observed for each configuration across varying batch sizes:

Batch Size	Paiton (MI300X) ROCm 6.4.1 / vLLM 0.9.0	MI300X ROCm 7.0 / vLLM 0.10.0	NVIDIA H200 CUDA 13 / vLLM 0.10.2	NVIDIA B200 CUDA 13 / vLLM 0.10.2
1	189.37	162.24	189.51	180.11
2	347.77	299.53	331.47	339.00
4	580.07	496.88	551.72	610.16
8	1,397.40	1,122.12	1,277.69	1,457.11
16	2,855.09	1,989.90	2,222.61	2,840.19
32	4,613.32	3,209.90	3,862.38	4,588.88
64	7,234.05	5,573.63	6,687.79	8,368.82
128	9,554.87	8,717.82	10,489.65	13,884.28
256	14,672.35	12,587.27	16,129.35	21,980.31

Observations

While raw throughput numbers show competitive performance across the board, a deeper analysis, particularly concerning cost, reveals significant differentiators. Paiton, with its MoE support, demonstrates robust and smoothly scaling throughput, maintaining predictable latency even at higher batch sizes. This stands in contrast to some competitor stacks, which exhibit greater variance and potentially lower peak efficiency.

Token Economics

On-Demand Cost Analysis

To provide a truly meaningful comparison, we analyzed the cost per 1 million tokens, a critical metric for production deployments. This analysis considers the cheapest credible on-demand $/GPU-hr pricing available for each GPU family, combined with our measured batch-size 32 throughput. Batch size 32 was chosen as it represents a common operating point for many conversational MoE deployments.

Transparent formula

The cost per 1 million tokens is calculated using the following formula:

$ per 1M tokens = (1,000,000 × R) / (T × 3600)

Where:

R = on-demand $/hr price per GPU
T = tokens/sec (per GPU)

On-demand results (cheapest credible listings)

GPU	Throughput (tokens/sec)	$/GPU/hr	Tokens per $	$ per 1M tokens
Paiton MI300X	4,613.32	$1.50	11,071,968	$0.090
Stock AMD MI300X	3,209.90	$1.50	7,703,760	$0.130
NVIDIA H200	3,862.38	$2.59	5,368,559	$0.186
NVIDIA B200	4,588.88	$3.75	4,405,325	$0.227

Cheapest on-demand pricing with batch-size 32 throughput → Paiton MI300X leads on $/1M tokens.

Executive readout: The data clearly indicates that Paiton MI300X achieves approximately $0.090 per 1 million tokens, a significant reduction compared to NVIDIA H200 (~~$0.186) and B200 (~~$0.227). This substantial cost differential represents tangible return on investment, not a marginal improvement.

Why Paiton wins

Paiton’s compelling performance and cost-efficiency can be attributed to several key architectural and software optimizations:

Kernel-Level MoE Optimization: Paiton’s custom kernels are designed to maximize arithmetic intensity and memory locality. By bypassing generic library overheads, these kernels achieve a more efficient execution path for MoE models, directly translating to higher performance.
Strategic HBM Leverage on MI300X: The AMD MI300X’s generous 192 GB of High Bandwidth Memory (HBM) is strategically utilized by Paiton. This ample memory capacity ensures that expert weights and key-value (KV) caches remain “hot” in memory, drastically reducing the need for costly memory swaps. This is a critical factor in achieving sustained high performance and, consequently, lower operational costs.
Reduced Kernel Calls: Paiton fuses multiple operations into larger, expert-aware kernels, which means far fewer kernel launches at runtime. By cutting the number of kernel launches, Paiton reduces latency, improves GPU occupancy, and increases arithmetic intensity. All of which translate to higher throughput and more stable performance compared to stacks that execute many small kernels.
ROCm Version Agnosticism: Paiton’s custom kernels are designed to be largely independent of specific ROCm version changes. This provides greater stability and flexibility for deployment, as performance remains consistent even with updates to the underlying ROCm software stack.

Reproducibility & Pricing Transparency

We are committed to transparency and reproducibility.

Throughput/Latency Data: All raw data is included within this post.
Formulas: The formulas used for cost calculations are explicitly stated.
Pricing: On-demand pricing reflects the lowest credible public listings available at the time of this benchmark.

Conclusion

The New Cost-Per-Token Baseline

For organizations deploying large Mixture-of-Experts models in production, Paiton on AMD MI300X emerges as the undisputed cost-per-token baseline. The benchmark results unequivocally confirm that Paiton leaves other solutions behind when serving models like Qwen3-30B-A3B-Instruct-2507 at scale. This advantage holds true for both on-demand cloud deployments and on-premises “owned iron” infrastructures. The alignment of superior performance with significant cost savings makes Paiton an indispensable solution for achieving optimal return on investment in LLM inference.

Furthermore, we are continuously optimizing our solutions and anticipate even more impressive results in the near future. Stay tuned for upcoming updates, including detailed FP8 results, which we will be posting very soon.

Reach out and let’s talk :)

References