cuTile.jl Accelerates GPU Development with AI-Assisted Translation and Major v0.3 Performance Upgrades

‹

›

Product Updates

cuTile.jl Accelerates GPU Development with AI-Assisted Translation and Major v0.3 Performance Upgrades

Product Updates

cuTile.jl Accelerates GPU Development with AI-Assisted Translation and Major v0.3 Performance Upgrades

Date Published

May 8, 2026

Contributors

JuliaHub

Date Published

May 8, 2026

Contributors

JuliaHub

Our recent blog post talked about how cuTile.jl brings NVIDIA’s tile-based GPU programming model to Julia, enabling developers to write high-performance CUDA kernels using intuitive tile abstractions instead of low-level thread management. Now, the ecosystem is taking another major step forward, making it easier not only to write new GPU kernels in Julia, but also to migrate existing high-performance implementations from Python with AI-assisted workflows.

This technical blog post by NVIDIA Developer Blog explores how developers can systematically translate cuTile Python kernels into cuTile.jl using structured AI “skills” within the TileGym repository. The workflow formalizes the conversion process for kernels such as matrix multiplication and softmax, helping developers avoid subtle “semantic traps” where code compiles successfully but produces incorrect results.

These differences, including Julia’s 1-based indexing, column-major memory layout, and explicit broadcasting syntax, can easily introduce silent bugs during manual translation. To address this, the workflow combines API mappings, static validators, debugging guides, and reusable examples into a repeatable AI-assisted process that validates kernels before they ever reach the GPU.

The result is a dramatically faster and more reliable path for bringing battle-tested GPU kernels into the Julia ecosystem, allowing developers in scientific computing, physics simulation, and AI to leverage NVIDIA Blackwell and Hopper architectures directly from Julia with minimal friction.

At the same time, cuTile.jl itself continues to mature rapidly. The newly released cuTile.jl v0.3 significantly improves both performance and usability, bringing tile-based GPU programming even closer to the standard Julia workflow.

One of the biggest updates is deep integration with CUDA.jl, allowing developers to launch cuTile kernels using the familiar @cuda macro simply by specifying backend=cuTile. The release also substantially reduces “time-to-first-launch” latency, bringing tiled kernel compilation overhead much closer to traditional CUDA.jl execution.

Performance Parity and Beyond

With a new dataflow framework and enhanced IR-level optimization pipeline, cuTile.jl v0.3 now matches or outperforms cuTile Python across every shipped benchmark. Several workloads demonstrate substantial performance gains, including:

Attention (FMHA): +63% faster than Python
Layer Norm (Forward): +37% faster than Python
Mixture of Experts: +34% faster than Python

Beyond performance improvements, v0.3 introduces production-ready features for scientific and AI workloads, including support for array slicing with @view and a new tile-vectorized Philox-based random number generator for high-performance stochastic workloads.

Together, these advances represent an important milestone for GPU computing in Julia: developers can now more easily migrate optimized kernels from Python while benefiting from a rapidly maturing runtime that delivers world-class performance directly within the Julia ecosystem.

Want to see cuTile.jl in action? Join Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) for a joint webinar on May 12, 2026, at 1 PM ET, where they will explore the design of CUDA Tile, the architecture of cuTile.jl, and practical GPU programming examples.

Read the full NVIDIA technical blog on AI-assisted kernel translation, explore the cuTile.jl v0.3 release notes, and sign up for the upcoming webinar.

Authors

JuliaHub

Company

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Authors

JuliaHub

Company

Authors

JuliaHub

Company

‹ Why Early Adopters Move to Dyad

May 2026 Newsletter: JuliaHub Announces Dyad 3.0 and Secures $65M Series B ›

Learn about Dyad

Get Dyad Studio – Download and install the IDE to start building hardware like software.

Read the Dyad Documentation – Dive into the language, tools, and workflow.

Join the Dyad Community – Connect with fellow engineers, ask questions, and share ideas.

Want to get enterprise support, schedule a demo, or learn about how we can help build a custom solution? We are here to help.

Contact Sales ›

Recent Blog Posts

All Blog Posts ›

May 27, 2026

•

Company News

JuliaHub Announces Dyad 3.0 General Availability, Bringing Agentic AI to Physics-Based Engineering

JuliaHub

May 26, 2026

•

Product Updates

JuliaHub 26.3: Expanding Administrative Autonomy and Code Security

Mridul Upadhyay

May 19, 2026

•

Product Updates

Dyad 3.0: A New Baseline for Systems Engineering

Mridul Upadhyay

Recent Blog Posts

All Blog Posts ›

May 27, 2026

•

Company News

JuliaHub Announces Dyad 3.0 General Availability, Bringing Agentic AI to Physics-Based Engineering

May 26, 2026

•

Product Updates

JuliaHub 26.3: Expanding Administrative Autonomy and Code Security

May 19, 2026

•

Product Updates

cuTile.jl Accelerates GPU Development with AI-Assisted Translation and Major v0.3 Performance Upgrades

cuTile.jl Accelerates GPU Development with AI-Assisted Translation and Major v0.3 Performance Upgrades

Performance Parity and Beyond

Tags

Authors

JuliaHub

Authors

JuliaHub

Authors

JuliaHub

Recent Blog Posts

JuliaHub Announces Dyad 3.0 General Availability, Bringing Agentic AI to Physics-Based Engineering

JuliaHub 26.3: Expanding Administrative Autonomy and Code Security

Dyad 3.0: A New Baseline for Systems Engineering

Recent Posts

JuliaHub Announces Dyad 3.0 General Availability, Bringing Agentic AI to Physics-Based Engineering

JuliaHub 26.3: Expanding Administrative Autonomy and Code Security

Recent Blog Posts

JuliaHub Announces Dyad 3.0 General Availability, Bringing Agentic AI to Physics-Based Engineering

JuliaHub 26.3: Expanding Administrative Autonomy and Code Security

Dyad 3.0: A New Baseline for Systems Engineering