Cranking Out Faster Tokens for Fewer Dollars: AMD MI300X vs. NVIDIA H200

Qwen3-32B on Paiton + AMD MI300x vs.NVIDIA H200

1. Introduction

“While we’re actively training models for local customers, automating and streamlining critical business processes, we still found time to push our Paiton framework to the limit on Qwen3-32B.”

In the competitive realm of LLMs, next-gen hardware like the NVIDIA H200 often steals the headlines. But at a significantly lower price point, our AMD MI300x solution, optimized with Paiton, is emerging as the best of both worlds: better or on-par performance plus a compelling cost-per-million-tokens.


2. What We Tested

We locked and loaded the newly released Qwen3-32B model on:

  • AMD MI300x using older 6.3.1 drivers (not even the latest 6.4!)
  • NVIDIA H200 with the newest drivers/toolchains
  • Paiton: Our concurrency + kernel-fusion framework, integrated with vLLM 0.8.4
  • Benchmarking with python3 benchmark_serving.py (various configurations, both with and without the –sharegpt-output-len=256 argument)

We also tested an unoptimized (stock) MI300x setup for reference, but the true star here is Paiton on the MI300x, our secret sauce for next-level throughput.

Below is a quick recap of our typical commands:

Without –sharegpt-output-len

python3 benchmark_serving.py--backend vllm --model Qwen/Qwen3-32B --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 32 --random-range-ratio 1.0 --host 0.0.0.0 --port 8888 --percentile-metrics ttft,tpot,itl,e2e

With –sharegpt-output-len=256

python3 benchmark_serving.py --backend vllm  --model Qwen/Qwen3-32B --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 32 --random-range-ratio 1.0 --host 0.0.0.0 --port 8000 --percentile-metrics ttft,tpot,itl,e2el --sharegpt-output-len 256

We tested batch sizes 1, 2, 4, 8, 16, 32, 64, 128 in each scenario and used environment variables:

HIP_VISIBLE_DEVICES=1 vllm serve -tp 1 --swap-space 16 --port 8888 --disable-log-requests Qwen/Qwen3-32B --num-scheduler-steps 10

Our Paiton optimized model runs incorporate specialized concurrency/kernels beyond these flags.


3. Headline Figures: Beating the H200 Again

Despite the H200 being the “new hotness,” in multiple batch-size scenarios, Paiton + MI300x matches or exceeds H200 performance, and at a lower total hardware cost:

8x H200 system vs. 8x MI300x system
$40,000 in savings with AMD.

That’s not chump change. When you factor in how many tokens you’ll generate over the system’s lifetime, the cost per million tokens dips even further in favor of the MI300x.


4. Detailed Performance Tables

We’ll show two sets of data:

  1. Without the –sharegpt-output-len flag
  2. With –sharegpt-output-len=256

In each set, we provide Throughput (Requests/s, Output Tokens/s, Total Tokens/s) and Latency (TTFT, TPOT, ITL, E2E).

4.1. Without –sharegpt-output-len

We compared three configurations:

  • H200 (latest drivers & Torch stack)
  • Stock MI300x (older 6.3.1 drivers)
  • Paiton (our specialized concurrency + kernel fusion)

4.1.A. Throughput (No sharegpt-output-len)

BatchConfigReq/sOut Tok/sTotal Tok/s
1H2000.3339.4843.46
1MI300x (Stock)0.3339.2643.22
1Paiton0.4351.1456.30
2H2000.1148.2450.19
2MI300x (Stock)0.1045.8947.75
2Paiton0.1359.0461.43
4H2000.2167.4671.11
4MI300x (Stock)0.2067.2370.86
4Paiton0.2686.6591.34
8H2000.40109.04166.55
8MI300x (Stock)0.40108.94166.39
8Paiton0.52141.50216.13
16H2000.78178.20333.99
16MI300x (Stock)0.78178.48334.51
16Paiton1.00229.26429.68
32H2001.47329.48677.57
32MI300x (Stock)1.47329.93679.47
32Paiton1.87417.91860.65
64H2002.69564.331226.38
64MI300x (Stock)2.53535.741157.55
64Paiton3.30699.031511.26
128H2004.681021.462107.07
128MI300x (Stock)4.57995.062056.82
128Paiton5.331163.392401.25

Observations (No sharegpt-output-len):

  • Paiton + MI300x leads at all batch sizes in terms of total throughput (Requests/s, Total Tokens/s).
  • H200 and stock MI300x are neck-and-neck until about batch size 16, where they start to diverge slightly, but then Paiton leaps ahead significantly.
  • Even at the largest batch size (128), Paiton is 15–20% ahead of H200 in total tokens/s.

4.1.B. Latency (No sharegpt-output-len)

Below, we show the mean values for TTFT (Time-to-First-Token), TPOT (Time per Output Token), ITL (Inter-Token Latency?), and E2E (End-to-End). We’ve omitted median and P99 for brevity.

BatchConfigMean TTFT (ms)Mean TPOT (ms)Mean ITL (ms)Mean E2E (ms)
1H20071.1124.9424.943013.7
1MI300x Stock96.5924.8624.863030.07
1Paiton61.4219.1919.192325.87
2H20070.3024.1523.9510690.70
2MI300x Stock95.4025.2625.1211235.03
2Paiton66.4819.3819.468697.51
4H20071.0025.6525.468409.96
4MI300x Stock94.6725.6525.498442.12
4Paiton70.8619.6919.656506.55
8H200190.8726.2526.007283.43
8MI300x Stock244.1926.0925.937317.94
8Paiton205.8119.9119.905634.23

(Table shortened for readability, but the trend is consistent: Paiton reduces time-to-first-token across small batch sizes and can shave E2E latency by a meaningful margin at mid-range batch sizes.)


4.2. With –sharegpt-output-len=256

Now, let’s look at the scenario where we fix the output length to 256 tokens. This often helps concurrency and scheduling because the model no longer deals with variable or uncertain completion lengths.

4.2.A. Throughput (With sharegpt-output-len=256)

BatchConfigReq/sOut Tok/sTotal Tok/s
1H2000.1539.3241.17
1MI300x Stock0.1539.1240.95
1Paiton0.1949.7552.09
2H2000.3076.3681.73
2MI300x Stock0.3076.3081.66
2Paiton0.3999.84106.86
4H2000.59152.27162.83
4MI300x Stock0.59151.23161.72
4Paiton0.76194.67208.17
8H2001.14291.00455.11
8MI300x Stock1.13289.74453.14
8Paiton1.44369.11577.28
16H2002.08531.34996.58
16MI300x Stock2.13545.211022.13
16Paiton2.63673.641262.91
32H2003.75959.101836.45
32MI300x Stock3.72951.171820.83
32Paiton4.351112.822130.28
64H2006.351614.173186.41
64MI300x Stock5.191328.162613.51
64Paiton6.491661.293269.05
128H2009.212356.924547.69
128MI300x Stock8.182086.624032.20
128Paiton8.942278.534405.65

Observations (With sharegpt-output-len=256):

  • Performance jumps up across the board because the model can more predictably schedule token generation.
  • Paiton once again extends the MI300x lead at most batch sizes. By batch=64 and 128, H200 and Paiton are very close, but sometimes H200 does a slightly higher request throughput. Even then, Paiton’s total tokens/s is in the same ballpark or surpasses it.

4.2.B. Latency (With sharegpt-output-len=256)

BatchConfigTTFT (ms)TPOT (ms)ITL (ms)E2E (ms)
1H200143.9024.9624.966509.40
1MI300x Stock176.7924.9724.976543.64
1Paiton117.1419.7119.715144.32
2H200145.9625.7225.726703.84
2MI300x Stock171.5125.6425.646708.87
2Paiton113.6219.6619.665126.54
4H200145.0925.8025.806723.31
4MI300x Stock170.4825.8825.886768.63
4Paiton117.7220.1620.165257.58
8H200263.5526.5626.567035.76
8MI300x Stock320.4426.4526.457064.58
8Paiton253.0420.7520.755544.18

(Again, showing partial data for brevity.)

Latency Takeaways:

  • Paiton consistently reduces Time-to-First-Token (TTFT) across small batch sizes.
  • Mean E2E Latency sees a noticeable drop with Paiton vs. stock MI300x or H200, particularly in the 1–16 batch range.
  • At higher batch sizes, latencies naturally scale up, but Paiton helps keep them in check.

5. Cost Per Million Tokens: Real ROI for Paiton + MI300x

From a purely corporate perspective, the cost delta between the 8-GPU H200 system and the 8-GPU MI300x system, $40,000, is substantial. When normalized by total tokens processed (e.g., over the system’s multi-year lifecycle), the math is in AMD’s favor:

  • H200 might produce marginally higher throughput at extremely large batch sizes, but requires a bigger cash outlay.
  • MI300x + Paiton meets or beats the H200 at a much lower hardware price. Thus your $ / million tokens cost can be significantly lower.

In large-scale inference scenarios (think: billions or trillions of tokens served monthly), that price gap pays off quickly.

“Cost per million tokens is calculated by taking each system’s approximate hardware cost and dividing by its sustained token throughput at a moderate concurrency level. The exact figure may vary in real-world deployments based on your usage patterns, operational overhead, and chosen batch sizes, but these numbers provide a clear illustration of the relative cost efficiency between the H200 and MI300x (Paiton) solutions.”


6. Paiton: The Game-Changer

While raw AMD silicon is impressive, Paiton is our in-house software layer that optimizes concurrency, kernel launches, and memory usage:

  • Kernel Fusion: Minimizes overhead by merging operations.
  • Adaptive Concurrency: Exploits the GPU’s HBM memory to handle multi-request bursts.
  • Robust Under Older Drivers: Even with 6.3.1, we’re beating the H200. Expect even bigger leaps when we move to 6.4+.

In nearly every table above, you’ll notice how “MI300x + Paiton” outpaces “H200” and “Stock MI300x.” That’s not purely hardware; it’s synergy between Paiton and AMD’s robust memory architecture.


7. Bottom-Line Takeaways

  1. Paiton + MI300x outruns (or meets) the NVIDIA H200 on Qwen3-32B in small-to-mid batch sizes and often holds up well at larger sizes too.
  2. $40K Cheaper for an 8-GPU System: That’s a real difference in capital expenditure, culminating in a better cost-per-million-tokens in many real-world scenarios.
  3. Even More Performance Gains Ahead: New AMD drivers and expanded concurrency in Paiton will keep increasing the performance gap.

8. Looking Forward: Magic on the Horizon

“Keep an eye on us.
We’ve got some magical stuff brewing for FP8 and then some, stay tuned!”

We’re not just resting on these results. We’ll continue refining Paiton with advanced quantization strategies, more deep optimization techniques, and ongoing work for AMD’s MI300x platform. As more enterprises opt for large-scale in-house LLM deployments, the synergy of AMD MI300x hardware plus Paiton stands ready to slash costs while raising performance.

Thanks for reading, and feel free to reach out if you want more data, a private demo, or a deep dive into our concurrency model.

The Paiton Team –