Northeastern UniversityÕs (NU) high-performance computing cluster (Discovery) provides access to 45,000+ CPU cores and 200+ GPUs. Resource requests on Discovery can range up to a few thousand CPUs per job, producing a high variation in wait (queue) times (seconds to days). The increasing number of shared resources (CPUs, GPUs, node architecture types) and diverse job requests become important factors in deciding the cluster’s resource usage patterns. In order to improve these patterns, it is necessary to determine the relationship between a job’s requested resources and queue time. Currently, it is possible to obtain an estimated queue time from the job scheduler but only after the job has been submitted. Additionally, the scheduler works as a black box without showing a relationship between queue times and requested resources.
We address these issues by building a machine learning model to analyze 6 million+ records generated by 1300+ Discovery users. The model establishes a relationship between queue time and requested resources and predicts the queue time based on those resources. Our findings indicate that queue time is dependent on the user’s priority (assigned by the scheduler), requested CPU hours, partition to which the job is submitted, the time of year, and the time of day when the job is requested. Our prediction will help users estimate their queue time even before their job is submitted and will allow efficient allocation of resources by the scheduler. The overall approach establishes the foundation for future modeling and effectual tuning of the job scheduling system.