-
Notifications
You must be signed in to change notification settings - Fork 129
Description
Hello! I am currently exploring SimAI for research into data center network congestion and topology optimization. I've successfully used SimAI for previous single job research explorations, but I am looking to simulate multi-tenant scenarios similar to those described in the Crux paper (SIGCOMM '24).
Specifically, I want to observe the internode network contention that occurs when two different training jobs (e.g., Job A and Job B) are running simultaneously on the same cluster fabric. Looking at the current documentation and workload format, SimAI appears designed to simulate a single monolithic job at a time, where all ranks belong to one global communicator.
Native Support: Does SimAI currently support defining multiple independent jobs with different start times or independent communicators in a single simulation run?
Manual Workaround: If native scheduling isn't supported, is there a recommended approach to manually "merge" two workloads into a single workload?
Context: My goal is to measure how the placement of independent jobs affects Flow Completion Time and Queue Depth at the ToR/Spine level. I want to verify if SimAI can capture the interference between these independent traffic patterns.
Thank you for your help!