/

/

cuTile.jl: Bringing NVIDIA's Tile-Based GPU Programming to Julia

cuTile.jl: Bringing NVIDIA's Tile-Based GPU Programming to Julia

cuTile.jl: Bringing NVIDIA's Tile-Based GPU Programming to Julia

Date Published

Mar 6, 2026

Contributors

Share

Date Published

Mar 6, 2026

Contributors

Share

Tim Besard, Software Engineer at JuliaHub recently announced an early release of cuTile.jl, a Julia package that makes it possible to program CUDA GPUs using a new tile-based abstraction by NVIDIA.

Traditionally, programming GPUs with CUDA requires developers to meticulously manage low-level hardware details like individual threads, warps, and memory hierarchies. With the introduction of CUDA Tile, you can now describe operations on tiles (chunks) of data. The compiler automatically handles the complex hardware mapping, unlocking access to specialized hardware like Tensor Cores without the boilerplate.

This powerful new programming model is now also available in Julia with the cuTile.jl package, greatly simplifying the development of high-performance Julia GPU kernels. For example, a matrix multiplication kernel that reaches 75% of CUBLAS performance is now as simple as:

function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2},
                       C::ct.TileArray{T,2},
                       tm::Int, tn::Int, tk::Int) where {T}
   M = size(A, 1)
   N = size(B, 2)
   K = ct.num_tiles(A, 2, (tm, tk))
   m, n = ct.bid(1), ct.bid(2)
   # K reduction loop - accumulate partial products
   acc = ct.full((tm, tn), zero(T), T)
   k = Int32(1)
   while k <= K
       a = ct.load(A, (m, k), (tm, tk))
       b = ct.load(B, (k, n), (tk, tn))
       acc = muladd(a, b, acc)
       k += Int32(1)
   end
   ct.store(C, (m, n), convert(ct.Tile{T}, acc))
   return nothing
end

The code above is slightly simplified for the sake of brevity; for the full version please refer to the examples in the cuTile.jl repository.

Key Takeaways from the Release:

  • Idiomatic Julia Syntax: cuTile.jl allows you to write high-performance kernels using familiar Julia conventions. You can use standard functions (sum, sqrt), 1-based indexing, and native broadcast syntax (.^, .-) directly in your GPU code.

  • Python Parity: The package shares the exact same abstraction level and Tile IR backend as cuTile Python. This means code is easily portable between the two languages, and both produce the same highly optimized GPU machine code.

  • Top-Tier Performance: In early benchmarks on NVIDIA Blackwell GPUs (RTX 5080), cuTile.jl achieves performance parity with its Python counterpart on kernels such as vector addition and matrix multiplication.

If you are already writing GPU code in Julia using CUDA.jl, transitioning to this tile-based abstraction is designed to be straightforward and highly rewarding for complex algorithms.

Read the full technical breakdown and learn how to get started on the NVIDIA Technical Blog.

Authors

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Authors

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Authors

JuliaHub, formerly Julia Computing, was founded in 2015 by the four co-creators of Julia (Dr. Viral Shah, Prof. Alan Edelman, Dr. Jeff Bezanson and Stefan Karpinski) together with Deepak Vinchhi and Keno Fischer. Julia is the fastest and easiest high productivity language for scientific computing. Julia is used by over 10,000 companies and over 1,500 universities. Julia’s creators won the prestigious James H. Wilkinson Prize for Numerical Software and the Sidney Fernbach Award.

Learn about Dyad

Get Dyad Studio – Download and install the IDE to start building hardware like software.

Read the Dyad Documentation – Dive into the language, tools, and workflow.

Join the Dyad Community – Connect with fellow engineers, ask questions, and share ideas.

Contact Us

Want to get enterprise support, schedule a demo, or learn about how we can help build a custom solution? We are here to help.