Skip to content

[Feature Question] Support for simulating multiple concurrent training jobs (Multi-tenancy) #213

@Hrishis2

Description

@Hrishis2

Hello! I am currently exploring SimAI for research into data center network congestion and topology optimization. I've successfully used SimAI for previous single job research explorations, but I am looking to simulate multi-tenant scenarios similar to those described in the Crux paper (SIGCOMM '24).

Specifically, I want to observe the internode network contention that occurs when two different training jobs (e.g., Job A and Job B) are running simultaneously on the same cluster fabric. Looking at the current documentation and workload format, SimAI appears designed to simulate a single monolithic job at a time, where all ranks belong to one global communicator.

Native Support: Does SimAI currently support defining multiple independent jobs with different start times or independent communicators in a single simulation run?

Manual Workaround: If native scheduling isn't supported, is there a recommended approach to manually "merge" two workloads into a single workload?

Context: My goal is to measure how the placement of independent jobs affects Flow Completion Time and Queue Depth at the ToR/Spine level. I want to verify if SimAI can capture the interference between these independent traffic patterns.

Thank you for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions