/

/

cuTile.jl Accelerates GPU Development with AI-Assisted Translation and Major v0.3 Performance Upgrades

cuTile.jl Accelerates GPU Development with AI-Assisted Translation and Major v0.3 Performance Upgrades

cuTile.jl Accelerates GPU Development with AI-Assisted Translation and Major v0.3 Performance Upgrades

Date Published

Contributors

Share

Date Published

Contributors

Share

Our recent blog post talked about how cuTile.jl brings NVIDIA’s tile-based GPU programming model to Julia, enabling developers to write high-performance CUDA kernels using intuitive tile abstractions instead of low-level thread management. Now, the ecosystem is taking another major step forward,  making it easier not only to write new GPU kernels in Julia, but also to migrate existing high-performance implementations from Python with AI-assisted workflows.

This technical blog post by NVIDIA Developer Blog explores how developers can systematically translate cuTile Python kernels into cuTile.jl using structured AI “skills” within the TileGym repository. The workflow formalizes the conversion process for kernels such as matrix multiplication and softmax, helping developers avoid subtle “semantic traps” where code compiles successfully but produces incorrect results.

These differences, including Julia’s 1-based indexing, column-major memory layout, and explicit broadcasting syntax, can easily introduce silent bugs during manual translation. To address this, the workflow combines API mappings, static validators, debugging guides, and reusable examples into a repeatable AI-assisted process that validates kernels before they ever reach the GPU.

The result is a dramatically faster and more reliable path for bringing battle-tested GPU kernels into the Julia ecosystem, allowing developers in scientific computing, physics simulation, and AI to leverage NVIDIA Blackwell and Hopper architectures directly from Julia with minimal friction.

At the same time, cuTile.jl itself continues to mature rapidly. The newly released cuTile.jl v0.3 significantly improves both performance and usability, bringing tile-based GPU programming even closer to the standard Julia workflow.

One of the biggest updates is deep integration with CUDA.jl, allowing developers to launch cuTile kernels using the familiar @cuda macro simply by specifying backend=cuTile. The release also substantially reduces “time-to-first-launch” latency, bringing tiled kernel compilation overhead much closer to traditional CUDA.jl execution.

Performance Parity and Beyond

With a new dataflow framework and enhanced IR-level optimization pipeline, cuTile.jl v0.3 now matches or outperforms cuTile Python across every shipped benchmark. Several workloads demonstrate substantial performance gains, including:

  • Attention (FMHA): +63% faster than Python

  • Layer Norm (Forward): +37% faster than Python

  • Mixture of Experts: +34% faster than Python

Beyond performance improvements, v0.3 introduces production-ready features for scientific and AI workloads, including support for array slicing with @view and a new tile-vectorized Philox-based random number generator for high-performance stochastic workloads.

Together, these advances represent an important milestone for GPU computing in Julia: developers can now more easily migrate optimized kernels from Python while benefiting from a rapidly maturing runtime that delivers world-class performance directly within the Julia ecosystem.

Want to see cuTile.jl in action? Join Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) for a joint webinar on May 12, 2026, at 1 PM ET, where they will explore the design of CUDA Tile, the architecture of cuTile.jl, and practical GPU programming examples.

Read the full NVIDIA technical blog on AI-assisted kernel translation, explore the cuTile.jl v0.3 release notes, and sign up for the upcoming webinar.

Authors

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Authors

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Authors

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Learn about Dyad

Get Dyad Studio – Download and install the IDE to start building hardware like software.

Read the Dyad Documentation – Dive into the language, tools, and workflow.

Join the Dyad Community – Connect with fellow engineers, ask questions, and share ideas.

Contact Us

Want to get enterprise support, schedule a demo, or learn about how we can help build a custom solution? We are here to help.