Job processing in GPU computing - an implementation of a context-aware schedulerBachelor Thesis
7 June 2024, by Tarik Badawy

Photo: midjourney
The growing demand for GPU computing in artificial intelligence research has led to the need for more efficient resource management systems. Manual scheduling can result in inefficiencies such as increased job turnaround time and underutilization of resources. This thesis aims to develop and implement a context-aware scheduler for GPU-heavy jobs, addressing these inefficiencies by automating job scheduling and resource allocation. The proposed solution targets the university's existing infrastructure, which includes several underutilized CUDA-enabled GPUs, by distributing jobs across available worker nodes based on priority and resource requirements.
The scheduler is designed to integrate with the established orchestration systems Kubernetes, enabling the dynamic management of GPU resources and job scheduling. Critical components of the system include a user-friendly job entry system, the orchestration of heterogeneous worker nodes, and checkpointing functionality for suspending and migrating jobs without user intervention. The design prioritizes ease of use for both researchers and administrators, providing a scalable solution that abstracts the complexity of scheduling for everyday users while maintaining flexibility for more advanced use cases.
The research explores scheduling strategies like First Come First Serve (FCFS) with priority scheduling and incorporates metrics for job throughput, turnaround time, and resource utilization. Additionally, the thesis examines existing solutions in scheduling, orchestration, and GPU checkpointing, assessing their relevance to the system's goals. The evaluation phase will focus on the effectiveness of job scheduling, the overhead introduced by checkpointing, and the system's usability. This work provides a foundation for improving GPU resource management in academic environments, enhancing the accessibility and efficiency of AI research.
Supervised by:
Prof. Dr. Janick Edinger, Anton Semjonov