cuTile.jl v0.2, the first major update of the Julia package for writing GPU kernels using NVIDIA's CUDA model is available now. Writing high-performance GPU kernels has traditionally required deep expertise and low-level optimization. But with tools like cuTile.jl, that process is becoming more accessible, expressive, and powerful for Julia developers.
In its latest release, cuTile.jl v0.2 marks a major step forward, bringing significant performance improvements, expanded language support, and a more intuitive programming experience for developers working with GPU workloads.
One of the most notable highlights is the introduction of native Julia for loops, eliminating the need for workarounds and making kernel code more natural to write and read. The update also introduces a new ct.@fpmode macro, giving developers fine-grained control over floating-point behavior.
Beyond syntax improvements, the release pushes usability further with keyword-based APIs, enhanced debugging using standard print and println, and experimental host-level abstractions that can automatically generate tiled GPU kernels, reducing the need to write explicit kernel code.
Performance is another major focus. A redesigned multi-pass optimization pipeline dramatically improves the generated code, enabling better instruction-level parallelism, smarter memory handling, and more efficient arithmetic transformations.
To showcase these capabilities, the release includes additional advanced examples such as Mixture of Experts kernels and Flash Multi-Head Attention. Performance of these examples is on par with cuTile Python, demonstrating how cuTile.jl can handle modern AI and HPC workloads.
If you're working with GPU computing in Julia or exploring more efficient ways to build high-performance kernels, this release is worth a closer look.
Read the full blog to dive deeper into the features, benchmarks, and what’s next for cuTile.jl.






