MI300X vs H200 vs RX 7900 XTX vs Tenstorrent n300s with vLLM

As large language models (LLMs) become a foundational part of modern applications, picking the right server for deployment is more important than ever. Whether you’re an enterprise scaling up inference, a startup optimizing for cost, or a researcher pushing throughput boundaries. This blog compares two high-profile server setups and two not so high-profile setups which are usually not used as servers in a DC, each with unique GPU/accelerator hardware and using vLLM.

We’ll compare:

AMD MI300X
NVIDIA H200
Tenstorrent n300s
AMD RX 7900 XTX

While comparing the RX 7900 XTX to data center-class GPUs may seem unfair, it’s important to recognize its versatility. As a multi-purpose GPU capable of both high-end gaming and AI development, it offers a practical advantage: you can use the same system for development and leisure, making it a compelling option for individual developers, small teams, or data centers that want to dynamically switch between gaming and AI workloads based on demand.
That said, if we’re being honest, we’re including the RX 7900 XTX primarily because we already have it available and were curious to see how it stacks up against dedicated data center hardware.

The other odd one in this race is the Tenstorrent system, it’s been showing great potential and promise and it caught my eye about 3 years ago. We were very keen to test this system against its giant competitors because it would add some variety and excitement to the current two main players in AI (sorry Intel).

GPU Specs (as of Q2 2025)

Accelerator	Architecture	VRAM	Memory Type	TDP	Approx Price (USD)
AMD MI300X	CDNA3	192 GB	HBM3	750W	~$15,000
NVIDIA H200	Hopper	141 GB	HBM3e	700W	~$30,000
AMD RX 7900 XTX	RDNA3	24 GB	GDDR6	355W	~$1,000
Tenstorrent n300s	Custom RISC-V	24 GB	GDDR6	300W	~$1,399

Note: These values reflect typical specs for each GPU/accelerator. Actual performance can vary based on system integration and workload characteristics. (e.g., CPU, motherboard, cooling). The focus here is on the GPU/accelerator as the main differentiator.

Power Measurement Method

This comparison was challenging because accurate per-GPU cost per million tokens for MI300X and H200 is difficult to estimate since individual GPU prices are not publicly available. Therefore, we calculated cost per million tokens using the full system price and approximate power consumption. This approach makes more sense, in real deployments, system-level costs (power, hardware, infrastructure) contribute to operational expenses beyond just the GPU. To account for the full system, we multiplied the measured token throughput and single GPU power consumption by the number of GPUs in the server and then added that to the total system power consumption, effectively spreading system cost and power across all GPUs under the assumption of full utilization. Note that these values represent idle or estimated power consumption only and do not account for additional components or increased draw under full system utilization.

The idle power consumption for the H200 system was not directly measured, but an estimate was obtained from an article, which suggests that the system typically consumes around 2200W when idle. For more details, refer to the ServeTheHome article.

The idle power consumption of the RX 7900 system could not be directly measured and was instead estimated to be around 400W.

The power consumption values for other systems were measured using the ipmitool tool.

Server Configurations and Pricing

Server	CPU	RAM	Storage	Cooling	# of GPUs	Idle System Power (Est.)	System Price (USD)
A+ Server 8125GS-TNMR2	2x EPYC 9654	1536GB	4TB NVMe	Air	8	~2400W	~$260,564
SuperServer 821GE-TNHR	2x Intel Xeon Platinum 8468H	1536GB	4TB NVMe	Air	8	~2200W	~$307,336
DIY AMD Workstation	Ryzen 9 7950X	64GB	2TB NVMe	Air	2	~400W	~$3,500
Tenstorrent Loudbox	2x Intel® Xeon® Silver 4309Y	512GB	4TB NVMe	Passive	4	~700W	$12,000

Note: Configurations are representative. Real-world builds may vary depending on components, vendors, and integration costs. The prices here are found via www.thinkmate.com

Benchmark Setup

Note: All benchmarks were run using a single GPU or accelerator card per system to ensure a fair comparison across different hardware classes.

Framework: vLLM (paged attention + continuous batching)
Model: meta-llama/Meta-Llama-3-8B-Instruct
Workload: Concurrent prompts, batch size 32, fixed output length of 256 tokens
Dataset: ShareGPT
Metrics: Tokens/sec (throughput) and cost-performance

vLLM Benchmark Results (Batch Size 32 Only)

Server	# of GPUs	Tokens/sec per GPU	System Tokens/sec
AMD MI300X	8	7003.10	56,024.8
NVIDIA H200	8	8192.08	65,536.64
AMD RX 7900 XTX(*)	2	1113.59	2227.18
Tenstorrent Loudbox	4	1314.0	5256.0

Note: We encountered out-of-memory (OOM) errors when the model context length was 131072 with the AMD RX 7900 XTX. It was lowered to 22048, which is a significant change.

Cost per Million Tokens

Note: For the simplicity of comparison, we are using a 3-year lifespan (26280 hours). Power costs are calculated with a rate of $0.10 per kWh.

The calculation used to get the cost per 1M tokens can be seen and reviewed here:

Assuming a 3 year depreciation and full system+GPU utilization

AMD MI300X Server:

^{Assuming 2400W idle power consumption and 8 GPUs fully utilized at 750W, the total would be 8.4kW}

System Tokens/sec: 7003.10 tokens/sec/GPU * 8 GPUs = 56,024.8 tokens/sec
Tokens per hour: 56,024.8 tokens/sec * 3600 sec/hour = 201,689,280 tokens/hour
Cost per hour: $260,564 / 26280 hours + 8.4kW * $0.10/kWh = $10.76
Cost per 1M tokens: $10.76 / (201,689,280 / 1,000,000) = $0.053 per 1M tokens

AMD MI300X Server + Paiton:

System Tokens/sec: 7637.12 tokens/sec/GPU * 8 GPUs = 61,096.96 tokens/sec
Tokens per hour: 61,096.96 tokens/sec * 3600 sec/hour = 219,949,056 tokens/hour
Cost per hour: $260,564 / 26280 hours + 8.4kW * $0.10/kWh = $10.76
Cost per 1M tokens: $10.76 / (219,949,056 / 1,000,000) = $0.049 per 1M tokens

Nvidia H200 Server:

System Tokens/sec: 8192.08 tokens/sec/GPU * 8 GPUs = 65,536.64 tokens/sec
Tokens per hour: 65,536.64 tokens/sec * 3600 sec/hour = 235,931,904 tokens/hour
Cost per hour: $307,336 / 26280 hours + 7.8kW * $0.10/kWh = $12.48
Cost per 1M tokens: $12.48 / (235,931,904 / 1,000,000) = $0.053 per 1M tokens

AMD RX 7900 XTX Workstation:

System Tokens/sec: 1113.59 tokens/sec/GPU * 2 GPUs = 2227.18 tokens/sec
Tokens per hour: 2227.18 tokens/sec * 3600 sec/hour = 8,018,153 tokens/hour
Cost per hour: $3,500 / 26280 hours + 1.054kW * $0.10/kWh = $0.24
Cost per 1M tokens: $0.24 / (8,018,153 / 1,000,000) = $0.030 per 1M tokens with a context length of 22048

Tenstorrent Loudbox:

System Tokens/sec: 1314 tokens/sec/card * 4 cards = 5256 tokens/sec
Tokens per hour: 5256 tokens/sec * 3600 sec/hour = 18,921,600 tokens/hour
Cost per hour: $12,000 / 26280 hours + 1.9kW * $0.10/kWh = $0.65
Cost per 1M tokens: $0.65 / (18,921,600 / 1,000,000) = $0.034 per 1M tokens

Server	Cost per 1M Tokens
AMD MI300X	$0.053
AMD MI300X + Paiton	$0.049
NVIDIA H200	$0.053
AMD RX 7900 XTX	$0.030 (lower context length)
Tenstorrent Loudbox	$0.034

Observations

AMD MI300X: Competitive throughput with massive VRAM. Ideal for mid-to-large batch sizes and larger models like Llama 3-70B and beyond. Comparing a model like Llama3-8B was purely done due to the other cards not being able to load larger models on a single GPU, the MI300X should not be used for these use-cases unless GPU partitioning is done.

AMD MI300X + Paiton: Including the observations we’ve discussed above, Paiton achieved a 8.2% lower cost per million tokens compared to the standard MI300X setup, demonstrating improved cost-efficiency from software optimizations. The Paiton Framework is continuously improving, and as development progresses, this cost will continue to decrease over time.

NVIDIA H200: Industry-leading speed, similar to the AMD MI300X, and a mature CUDA software stack. Ease-of-use is definitely the case with Nvidia.

RX 7900 XTX: A cost-effective choice for individual developers. Not ideal for larger workloads due to VRAM limitations and the annoying context-length limitation, but great for light inferencing and development. Useful for AI/ML workloads and any other general GPU related workload.

Tenstorrent n300s: Innovative RISC-V architecture tailored for ML workloads. Emerging support in inference frameworks like vLLM, but the ecosystem is still growing. The cost per 1M tokens is competitive for smaller models, if the model is supported.

Final Thoughts

Choosing the right server hinges on your use case:

Startups or researcher: RX 7900 XTX workstation or Tenstorrent Loudbox for low-cost experimentation and inference (just make sure the model you want to use is supported by Tenstorrent).
Enterprises: MI300X and H200 balance performance and efficiency well, however, the MI300X GPU does provide significantly larger VRAM at a lower cost.

Given the surprisingly competitive cost-efficiency of the RX 7900 XTX in this analysis, we’re excited to announce that RDNA support for Paiton is currently in development to further unlock the potential of these GPUs.

In a fast-evolving AI hardware landscape, pairing the right accelerator with optimized inference frameworks like vLLM ensures you’re making the most of your infrastructure.