Tim Besard, Software Engineer at JuliaHub recently announced an early release of cuTile.jl, a Julia package that makes it possible to program CUDA GPUs using a new tile-based abstraction by NVIDIA.
Traditionally, programming GPUs with CUDA requires developers to meticulously manage low-level hardware details like individual threads, warps, and memory hierarchies. With the introduction of CUDA Tile, you can now describe operations on tiles (chunks) of data. The compiler automatically handles the complex hardware mapping, unlocking access to specialized hardware like Tensor Cores without the boilerplate.
This powerful new programming model is now also available in Julia with the cuTile.jl package, greatly simplifying the development of high-performance Julia GPU kernels. For example, a matrix multiplication kernel that reaches 75% of CUBLAS performance is now as simple as:
The code above is slightly simplified for the sake of brevity; for the full version please refer to the examples in the cuTile.jl repository.
Key Takeaways from the Release:
Idiomatic Julia Syntax: cuTile.jl allows you to write high-performance kernels using familiar Julia conventions. You can use standard functions (sum, sqrt), 1-based indexing, and native broadcast syntax (.^, .-) directly in your GPU code.
Python Parity: The package shares the exact same abstraction level and Tile IR backend as cuTile Python. This means code is easily portable between the two languages, and both produce the same highly optimized GPU machine code.
Top-Tier Performance: In early benchmarks on NVIDIA Blackwell GPUs (RTX 5080), cuTile.jl achieves performance parity with its Python counterpart on kernels such as vector addition and matrix multiplication.
If you are already writing GPU code in Julia using CUDA.jl, transitioning to this tile-based abstraction is designed to be straightforward and highly rewarding for complex algorithms.
Read the full technical breakdown and learn how to get started on the NVIDIA Technical Blog.





