The CUDA.jl package is the main programming interface for working with NVIDIA CUDA GPUs using Julia. It features a user-friendly array abstraction, a compiler for writing CUDA kernels in Julia, and wrappers for various CUDA libraries.
The recent update of CUDA.jl 5.4 features many memory-management enhancements that help improve performance of memory-heavy applications and facilitate the use of heterogeneous set-ups with multiple GPUs, or those using both CPU and GPU.
CUDA.jl v5.4 should be compatible with existing codebases as it bumps only the minor version.The bulk of the release focuses on memory management features. Let’s look at highlights of this release:
Eager garbage collection
Since Julia is a garbage collected language, GPU allocations can fail if garbage piles up, necessitating a collection cycle. Earlier versions of CUDA.jl managed this at the allocation site, detecting out-of-memory errors and triggering the garbage collection. This was far from ideal as it could lead to significant pauses and bloated memory usage.
To mitigate this issue, CUDA.jl v5.4 tracks memory usage more accurately, using the information to trigger the GC early on. For example, while waiting for a kernel to finish. This has the dual advantage of distributing the cost of garbage collection over time, and by potentially masking it behind other operations, ultimately leading to more predictable performance.
Tracked memory allocations
When using multiple GPUs, it is important to differentiate between the device that memory was allocated on, and the device used to execute code.
Now, CUDA.jl 5.4 tracks that device that owns the memory, and the stream last used to access it. This allows the package to handle memory usage in kernels or library functions correctly while keeping the user in control. It is especially valuable when using multiple GPUs, or when using multiple streams to more effectively use individual GPUs.
Unified memory iteration
In CUDA, unified memory allows memory to be accessed from both the CPU and the GPU. CUDA.jl 5.4 greatly improves the performance of using unified memory with CPU code. This feature is useful for incrementally porting code to the GPU without worrying about the performance of accessing memory from the CPU.
Other notable changes
CUDA.jl v5.4 includes many other changes, such as:
- Initial support for automatic differentiation of heterogeneous host/device code using Enzyme.jl
- CUDA.@profile now automatically detects external profiler
- Improvements in exception output
- Improved handling of cached library handles under memory pressure
- Tegra devices are now supported by our artifacts
- Support for CUDA 12.5 has been added, as well as initial support for Julia 1.12
You can access the full feature list and enhancements here.