Analyzing Power Efficiency of Optimization Techniques and Algorithm Design Methods for Applications on Heterogeneous Platforms
Graphics processing units (GPUs) have become widely accepted as the computing platform of choice in many high performance computing domains. The availability of programming standards such as OpenCL are used to leverage the inherent parallelism offered by GPUs. Source code optimizations such as loop unrolling and tiling when targeted to heterogeneous applications have reported large gains in performance. However, given the power consumption of GPUs, platforms can exhaust their power budgets quickly. Better solutions are needed to effectively exploit the power-efficiency available on heterogeneous systems. In this work, we evaluate the power/performance efficiency of different optimizations used on heterogeneous applications. We analyze the power/performance trade-off by evaluating energy consumption of the optimizations. We compare the performance of different optimization techniques on four different fast Fourier transform implementations. Our study covers discrete GPUs, shared memory GPUs (APUs) and low power system-on-chip (SoC) devices, and includes hardware from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Kepler), Intel (Ivy Bridge) and Qualcomm (Snapdragon S4) as test platforms. The study identifies the architectural and algorithmic factors which can most impact power consumption. We explore a range of application optimizations which show an increase in power consumption by 27%, but result in more than 1.8 × increase in speed of performance. We observe up to an 18% reduction in power consumption due to reduced kernel calls across FFT implementations. We also observe an 11% variation in energy consumption among different optimizations. We highlight how different optimizations can improve the execution performance of a heterogeneous application, but also impact the power efficiency of the application. More importantly, we demonstrate that different algorithms implementing the same fundamental function (FFT) can perform with vast differences based on the target hardware and associated application design.
International Journal of High Performance Computing Applications (IJHPCA)