Paiton Returns to Its Diffusion Roots: Optimizing Wan2.2-T2V-A14B on AMD MI355X

When we first started building Paiton, one of our earliest focus areas was optimizing diffusion models. Stable Diffusion XL was one of the first large models where we showed that fused operators, efficient execution, and hardware-aware kernels could make a real difference.

Now we are returning to those origins.With the growing interest in text-to-video generation, we have added support for Wan-AI/Wan2.2-T2V-A14B, a large text-to-video diffusion model. This is an important step for Paiton because it shows that our compiler and runtime approach is not limited to LLM inference. Paiton is built to optimize real AI workloads across model families, including diffusion, video generation, and multimodal systems.

Benchmark setup

For this comparison, we used the Hugging Face Diffusers library to run the Wan2.2-T2V-A14B model on two high-end accelerators:

AMD MI355X
NVIDIA B200

The goal was simple: compare generation time for the same model using the same Diffusers-based workflow. Using our Paiton-Diffusers plugin on the AMD MI355X.

Results

Based on the average generation times, the AMD MI355X completed generation in 17.6% less time than the NVIDIA B200.

GPU	Average generation time	Result
NVIDIA B200	6.672s	Baseline
AMD MI355X	5.501s	17.6% faster generation time

This is exactly the type of workload Paiton was built for: large models, heavy GPU execution, and expensive inference paths where every optimization matters.

Why diffusion models still matter

Large language models have received most of the attention in recent years, but diffusion models remain one of the most important classes of AI workloads. Image generation, video generation, 3D generation, and multimodal pipelines all depend on many of the same performance-critical patterns:

large matrix operations
normalization layers
attention blocks
memory-heavy tensor transformations
repeated denoising steps
operator scheduling overhead

These are exactly the areas where Paiton can provide value.

Our work on Wan2.2-T2V-A14B is a continuation of what we started with SDXL: taking complex diffusion pipelines and making them run faster through compilation, kernel optimization, and hardware-aware execution.

We do not aim for day-zero support

We want to be clear about our philosophy.

Paiton does not try to provide day-zero support for every new model the moment it appears online. That is not how we work.

We are European. We like to take our time.

Our goal is not to be first with a fragile implementation. Our goal is to publish support when the model is stable, the runtime is tested, and the performance work has been done properly. We would rather spend more time understanding the model architecture, profiling the actual bottlenecks, and optimizing the execution path than rush out a superficial integration.

For us, support does not simply mean “it runs.”

Support means:

the model runs correctly
the model is profiled properly
the expensive operations are understood
the implementation is optimized
the result is useful in production

That is the standard we want for Paiton.

What this means for Paiton

Supporting Wan2.2-T2V-A14B is another step toward making Paiton a broader AI inference optimization platform.

We started with diffusion models. We expanded into LLMs. Now we are bringing that experience back to video generation and modern diffusion workloads.

The result is a simple but important message:

Paiton can help AMD GPUs compete at the highest level on real AI workloads.

In this benchmark, using the Diffusers library, the AMD MI355X achieved better generation time than the NVIDIA B200 on Wan2.2-T2V-A14B.

For customers building inference services around video generation, image generation, or large multimodal models, this matters. Faster generation means lower latency, better utilization, and lower cost per output.

Paiton is built for exactly that.

Thank you to AMD & Supermicro

We also want to thank AMD and SuperMicro for their continued support.

Building an IR graph compiler and runtime system like Paiton requires close access to modern hardware, a strong software ecosystem, and technical support from people who understand the platform deeply. AMD’s support has helped us test, profile, and optimize Paiton on the latest AMD Instinct GPUs, including the MI355X.

For a European company building high-performance AI infrastructure, this support matters. It allows us to move faster, validate our work on real hardware, and show that AMD GPUs can compete strongly on demanding AI workloads such as large-scale video generation.

Paiton is independent technology, but having access to AMD’s hardware and ecosystem helps us push our optimization work further.

We appreciate the collaboration and look forward to continuing to build high-performance AI inference solutions on AMD GPUs.