Last Thursday, I had my 4th interview for an architecture intern position in Nvidia. There were 3 interesting questions valuable to share; first two questions that they asked me, then the question that I asked them.
Questions from the interviewer:
Q1: Why GPU runs so many threads?
A1: To hide memory latency.
This question is actually easy, but the next one he asked really makes me “suffered”.
Q2: Assume memory access latency on a CPU (GPU) is 100 cycles, where the memory system is able to return data for a request in every 4 cycles. What’s the minimum number of thread (warp) to fully hide the latency?
A2: According to Little’s Law:
Capacity = Latency * Throughput,
Capacity for this memory subsystem is 100 x (1/4) = 25. Therefore, the minimum number of threads to fully hide the latency is 25.
- To explain Little’s Law, we can use a water pipe example  showed in the figure below. Assume length of the pipe (L) is 8 m, cross section area (S) is 2 m², speed of the water flow (v) is 0.5 m/s. In other word, Latency is equal to L / v which is 16 s (8 / 0.5). Throughput can be calculated as S * v = 8 * 0.5 = 4 m³/s. Hence, according to Little’s Law, Capacity is defined as Latency * Throughput which can be calculated as 16 * 4 = 64 m³. Here capacity means the maximum amount of water the pipe can hold.
- Similar to the water pipe example explained in item 1, The memory system also has capacity, which is the maximum number of threads it can service at the same time. Therefore, if more threads try to send request, they would be delayed.
- During the interview, I thought since the latency is 100 cycles, then it would need 100 threads to fully hide the latency, assuming that in each cycle there will be one thread issuing a memory request. However, I failed to take the capacity of the memory system into account.
- According to A2, the memory system can sustain a request rate of up to 1 request per 4 cycle.
- Later I asked the interviewer whether they use this formula to decide about the maximum number of warps supported in each SM during design stage. The answer was YES.
Question to the interviewer:
Q3: For company like Nvidia, what are the most important factors to consider when designing a new generation of the GPU architecture?
A3: There are primary and secondary factors.
Primary: Profile target application to get operational intensity, i.e. Flops/Byte
As a company, the goal is to make product that customers want. For Nvidia, their customers want to have GPU that runs their application faster.To achieve that, Nvidia will profile the target application to get computation demand (i.e. Flops/sec) and memory demand( i.e. Byte/sec). With those two demands, Nvidia can have operational intensity describable by the metric “Flops/Byte”. This metric is used to guide the design of new GPU architecture, because it hints on whether to give priority to compute component or to memory sub-system.
Secondary: Due to time limit, the interviewer didn’t elaborate on it.
- The time I heard A3, one term struck my mind immediately: Roofline Model . This model, given a certain architecture, tells the attainable performance upper bound for application with different operational intensity (see figure below). Using this model, programmer can easily identify if their application is compute-bound or memory-bound. According to this result, the model would suggest different types of optimization techniques to reach the upper bound.
- Relating the roofline model to the interviewer’s answer (A3), I could guess how architects in Nvidia are applying this principle: If target application is memory-bound in the old architecture, more efforts would be spent to improve memory sub-system and vice versa.
- The next day after the interview, I chatted with Hamed about that, he was wondering how they handle the parallelism in target application. Damn it!:D I should have asked this question since Hamed and I have gone through a couple of discussion on this topic.
- In several occasions, Prof. Schirner and Hamed both revealed their dissatisfaction on company’s design methodology which is completely customer driven. They even argued that “It’s really sad because company would have made much greater product have they spared some efforts on improving design methodology”.
- Aater Suleman, Clarifying Throughput vs. Latency. page link
- Ali Hussain, Little’s Law – An insight on the relation between latency and throughput. page link
- Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (April 2009), 65-76. pdf link
- Thanks Nasibeh for giving a lot of useful comments for this post.