Discovery is Northeastern UniversityÕs (NU) high-performance computing cluster that comprises 1300+ users including principal investigators, students, and research staff. It has a constantly growing user database and receives close to 10,000 job requests per day with varying resource requirements (memory, CPU count, time, etc.). This gives rise to workload diversity with a wide range of usage patterns, which impacts efficient utilization of the clusterÕs resources. To improve the clusterÕs resource utilization, it is essential to understand and classify these usage patterns for all individuals and groups. The cluster presently has set a memory request limit of 500 gigabytes (GB) per node per job. This means that a user can request up to 500 GB of memory on one node, and up to a few terabytes (TB) when requesting multiple nodes. However, the memory utilization for a majority of jobs is less than 10%, indicating clear exploitation of the clusterÕs resources.
To address this issue, we have developed a data analytics framework and a machine learning (ML) model that predicts the memory required for a job submitted by a user based on their historical memory utilization patterns. The model uses various parameters, such as requested time and number of nodes as input to predict the memory. This results in a better estimation of resources available to the user while ensuring efficient usage of the cluster and making it easily accessible to NUÕs research community.