15-618 Project: DAG Scheduler

Summary

We propose to develop an intelligent Task Scheduler for Directed Acyclic Graphs (DAGs) that optimizes execution across heterogeneous CPU and GPU resources. The core objective is to minimize the total completion time of complex task sets by balancing computational speedups against data transfer overheads. We will implement a custom cost model that estimates task duration based on computational workload, data size, CPU/GPU processing rates, and PCIe bandwidth to make informed scheduling decisions.

Background

DAGs are the standard abstraction for managing task dependencies in high-performance computing. They are essential in Data Analytics (e.g., Spark), Model Training Pipeline , and Scientific Computing for managing large-scale simulation workflows. By modeling execution as a DAG, systems can identify independent tasks that can be executed in parallel while respecting strict synchronization constraints.

Modern performance gains rely on combining the low-latency control logic of CPUs with the massive data-parallel throughput of GPUs. Heterogeneous systems provide superior Performance-per-Watt and higher peak TFLOPS, but they introduce a "communication vs. computation" trade-off. An intelligent scheduler is required to decide when the acceleration on a GPU outweighs the PCIe transfer latency of moving data from CPU memory.

Feature

Online/Offline
Cost Model

Static/Dynamic
Scheduling Strategy

CUDA Stream /
CUDA Graph

Opportunities and Challenges

DAG workloads expose parallelism across tasks and devices (CPU + GPU), with further gains from techniques like CUDA Graphs

Performance is often limited by data transfer and synchronization overhead between devices

Accurate cost estimation is difficult due to hardware variability, making scheduling decisions workload-dependent

Architecture

Scheduler:scheduling tasks based on their dependencies and the cost model.

DependencyManager: manages task dependencies and updates their status dynamically.

SchedulingPolicy: determine the device for each task.

CostModel: estimates the execution time of a task on CPU and GPU, as well as the data transfer time between CPU and GPU.

ExecutionEngine: wraps tasks into objects to run within the CPU thread pool or GPU streams.

CPUThreadPool: manages a pool of CPU threads.

SystemRAM: manages the allocation of memory on the CPU.

MemoryPool: re-allocates large blocks of VRAM.

StreamManager: manages GPU streams.

Monitor: collects runtime statistics.

ProfileDatabase: stores profiling data for tasks, which can be used to improve the cost model over time.

DAG Scheduler

High-Performance Task Scheduling for Heterogeneous Systems

Summary

Background

Feature

Opportunities and Challenges

Architecture

Workflow

Project Documents

Proposal

Milestone Report

Final Report