Instance-Specific Kernel Compilation with NVIDIA CUDA

Abstract

Libraries of graphics processing unit (GPU) kernel implementations provide the potential for high performance and rapid runtime deployment. For the widest applicability, such kernels should deliver performance over a wide range of problem parameters, as well as varying GPU hardware. However, writing highly adaptable NVIDIA CUDA GPU kernel implementations can be very difficult for certain classes of problems, due to a number of complications arising from both the CUDA abstraction itself as well as characteristics of the available NVIDIA hardware. We are examining instance-specific compilation (ISC) as a technique to mitigate these problems. With ISC a custom CUDA GPU binary is compiled once both the problem parameters and target hardware are known. Where applicable, ISC can provide the performance and programmability benefits associated with hard-coded kernels while offering significantly more flexibility. Combining ISC with the incorporation of additional CUDA implementation-only parameters that are unrelated to the problem parameters allows ISC implementations to better adapt to both problem and hardware variations. Using a custom application framework that automates the ISC process, we have applied the approach to several applications, including particle image velocimetry, large template matching in the time-domain, and tomographic backprojection.