When running AlphaFast inference for 64 identical proteins (~450 tokens each) across 8 A800 GPUs (8 jobs per GPU), I observe a consistent pattern where the first batch of jobs takes ~6x longer than steady-state:
{"name": "TEST_0", "inference_seconds": 202.777, "status": "success"}
{"name": "TEST_60", "inference_seconds": 202.819, "status": "success"}
{"name": "TEST_16", "inference_seconds": 203.013, "status": "success"}
{"name": "TEST_23", "inference_seconds": 203.193, "status": "success"}
{"name": "TEST_45", "inference_seconds": 204.145, "status": "success"}
{"name": "TEST_38", "inference_seconds": 207.325, "status": "success"}
{"name": "TEST_30", "inference_seconds": 207.503, "status": "success"}
{"name": "TEST_52", "inference_seconds": 208.323, "status": "success"}
{"name": "TEST_10", "inference_seconds": 127.831, "status": "success"}
{"name": "TEST_61", "inference_seconds": 128.208, "status": "success"}
{"name": "TEST_24", "inference_seconds": 128.659, "status": "success"}
{"name": "TEST_46", "inference_seconds": 129.326, "status": "success"}
{"name": "TEST_17", "inference_seconds": 134.414, "status": "success"}
{"name": "TEST_39", "inference_seconds": 131.029, "status": "success"}
{"name": "TEST_53", "inference_seconds": 131.546, "status": "success"}
{"name": "TEST_31", "inference_seconds": 133.127, "status": "success"}
{"name": "TEST_11", "inference_seconds": 35.991, "status": "success"}
{"name": "TEST_62", "inference_seconds": 35.671, "status": "success"}
{"name": "TEST_25", "inference_seconds": 35.692, "status": "success"}
{"name": "TEST_47", "inference_seconds": 35.766, "status": "success"}
{"name": "TEST_18", "inference_seconds": 35.647, "status": "success"}
{"name": "TEST_40", "inference_seconds": 35.446, "status": "success"}
{"name": "TEST_54", "inference_seconds": 35.336, "status": "success"}
{"name": "TEST_32", "inference_seconds": 35.688, "status": "success"}
{"name": "TEST_12", "inference_seconds": 35.476, "status": "success"}
{"name": "TEST_63", "inference_seconds": 35.805, "status": "success"}
{"name": "TEST_26", "inference_seconds": 36.004, "status": "success"}
{"name": "TEST_48", "inference_seconds": 35.582, "status": "success"}
{"name": "TEST_19", "inference_seconds": 35.776, "status": "success"}
{"name": "TEST_41", "inference_seconds": 35.204, "status": "success"}
{"name": "TEST_55", "inference_seconds": 35.839, "status": "success"}
{"name": "TEST_33", "inference_seconds": 35.711, "status": "success"}
{"name": "TEST_13", "inference_seconds": 35.447, "status": "success"}
{"name": "TEST_6", "inference_seconds": 35.555, "status": "success"}
{"name": "TEST_27", "inference_seconds": 36.377, "status": "success"}
{"name": "TEST_49", "inference_seconds": 35.813, "status": "success"}
{"name": "TEST_20", "inference_seconds": 36.003, "status": "success"}
{"name": "TEST_42", "inference_seconds": 35.897, "status": "success"}
{"name": "TEST_56", "inference_seconds": 35.436, "status": "success"}
{"name": "TEST_34", "inference_seconds": 35.526, "status": "success"}
{"name": "TEST_14", "inference_seconds": 35.443, "status": "success"}
{"name": "TEST_7", "inference_seconds": 35.334, "status": "success"}
{"name": "TEST_28", "inference_seconds": 35.61, "status": "success"}
{"name": "TEST_50", "inference_seconds": 35.972, "status": "success"}
{"name": "TEST_43", "inference_seconds": 35.364, "status": "success"}
{"name": "TEST_21", "inference_seconds": 36.21, "status": "success"}
{"name": "TEST_57", "inference_seconds": 35.628, "status": "success"}
{"name": "TEST_35", "inference_seconds": 35.69, "status": "success"}
{"name": "TEST_15", "inference_seconds": 35.604, "status": "success"}
{"name": "TEST_8", "inference_seconds": 35.592, "status": "success"}
{"name": "TEST_29", "inference_seconds": 35.692, "status": "success"}
{"name": "TEST_51", "inference_seconds": 36.135, "status": "success"}
{"name": "TEST_44", "inference_seconds": 35.455, "status": "success"}
{"name": "TEST_22", "inference_seconds": 35.851, "status": "success"}
{"name": "TEST_58", "inference_seconds": 35.377, "status": "success"}
{"name": "TEST_36", "inference_seconds": 35.837, "status": "success"}
{"name": "TEST_1", "inference_seconds": 35.344, "status": "success"}
{"name": "TEST_9", "inference_seconds": 35.371, "status": "success"}
{"name": "TEST_3", "inference_seconds": 35.845, "status": "success"}
{"name": "TEST_5", "inference_seconds": 35.987, "status": "success"}
{"name": "TEST_4", "inference_seconds": 35.401, "status": "success"}
{"name": "TEST_2", "inference_seconds": 35.755, "status": "success"}
{"name": "TEST_59", "inference_seconds": 35.712, "status": "success"}
{"name": "TEST_37", "inference_seconds": 35.723, "status": "success"}
Body
When running AlphaFast inference for 64 identical proteins (~450 tokens each) across 8 A800 GPUs (8 jobs per GPU), I observe a consistent pattern where the first batch of jobs takes ~6x longer than steady-state:
All 64 inputs are the same protein sequence (~450 tokens), so the difference isn't data-dependent. The slowdown is strictly correlated with job launch order, not GPU identity — every GPU's first job is slow.
This pattern is consistent with:
torch.compilewarmup)Questions:
Environment:
Details
In
inference_timing.jsonl{"name": "TEST_0", "inference_seconds": 202.777, "status": "success"} {"name": "TEST_60", "inference_seconds": 202.819, "status": "success"} {"name": "TEST_16", "inference_seconds": 203.013, "status": "success"} {"name": "TEST_23", "inference_seconds": 203.193, "status": "success"} {"name": "TEST_45", "inference_seconds": 204.145, "status": "success"} {"name": "TEST_38", "inference_seconds": 207.325, "status": "success"} {"name": "TEST_30", "inference_seconds": 207.503, "status": "success"} {"name": "TEST_52", "inference_seconds": 208.323, "status": "success"} {"name": "TEST_10", "inference_seconds": 127.831, "status": "success"} {"name": "TEST_61", "inference_seconds": 128.208, "status": "success"} {"name": "TEST_24", "inference_seconds": 128.659, "status": "success"} {"name": "TEST_46", "inference_seconds": 129.326, "status": "success"} {"name": "TEST_17", "inference_seconds": 134.414, "status": "success"} {"name": "TEST_39", "inference_seconds": 131.029, "status": "success"} {"name": "TEST_53", "inference_seconds": 131.546, "status": "success"} {"name": "TEST_31", "inference_seconds": 133.127, "status": "success"} {"name": "TEST_11", "inference_seconds": 35.991, "status": "success"} {"name": "TEST_62", "inference_seconds": 35.671, "status": "success"} {"name": "TEST_25", "inference_seconds": 35.692, "status": "success"} {"name": "TEST_47", "inference_seconds": 35.766, "status": "success"} {"name": "TEST_18", "inference_seconds": 35.647, "status": "success"} {"name": "TEST_40", "inference_seconds": 35.446, "status": "success"} {"name": "TEST_54", "inference_seconds": 35.336, "status": "success"} {"name": "TEST_32", "inference_seconds": 35.688, "status": "success"} {"name": "TEST_12", "inference_seconds": 35.476, "status": "success"} {"name": "TEST_63", "inference_seconds": 35.805, "status": "success"} {"name": "TEST_26", "inference_seconds": 36.004, "status": "success"} {"name": "TEST_48", "inference_seconds": 35.582, "status": "success"} {"name": "TEST_19", "inference_seconds": 35.776, "status": "success"} {"name": "TEST_41", "inference_seconds": 35.204, "status": "success"} {"name": "TEST_55", "inference_seconds": 35.839, "status": "success"} {"name": "TEST_33", "inference_seconds": 35.711, "status": "success"} {"name": "TEST_13", "inference_seconds": 35.447, "status": "success"} {"name": "TEST_6", "inference_seconds": 35.555, "status": "success"} {"name": "TEST_27", "inference_seconds": 36.377, "status": "success"} {"name": "TEST_49", "inference_seconds": 35.813, "status": "success"} {"name": "TEST_20", "inference_seconds": 36.003, "status": "success"} {"name": "TEST_42", "inference_seconds": 35.897, "status": "success"} {"name": "TEST_56", "inference_seconds": 35.436, "status": "success"} {"name": "TEST_34", "inference_seconds": 35.526, "status": "success"} {"name": "TEST_14", "inference_seconds": 35.443, "status": "success"} {"name": "TEST_7", "inference_seconds": 35.334, "status": "success"} {"name": "TEST_28", "inference_seconds": 35.61, "status": "success"} {"name": "TEST_50", "inference_seconds": 35.972, "status": "success"} {"name": "TEST_43", "inference_seconds": 35.364, "status": "success"} {"name": "TEST_21", "inference_seconds": 36.21, "status": "success"} {"name": "TEST_57", "inference_seconds": 35.628, "status": "success"} {"name": "TEST_35", "inference_seconds": 35.69, "status": "success"} {"name": "TEST_15", "inference_seconds": 35.604, "status": "success"} {"name": "TEST_8", "inference_seconds": 35.592, "status": "success"} {"name": "TEST_29", "inference_seconds": 35.692, "status": "success"} {"name": "TEST_51", "inference_seconds": 36.135, "status": "success"} {"name": "TEST_44", "inference_seconds": 35.455, "status": "success"} {"name": "TEST_22", "inference_seconds": 35.851, "status": "success"} {"name": "TEST_58", "inference_seconds": 35.377, "status": "success"} {"name": "TEST_36", "inference_seconds": 35.837, "status": "success"} {"name": "TEST_1", "inference_seconds": 35.344, "status": "success"} {"name": "TEST_9", "inference_seconds": 35.371, "status": "success"} {"name": "TEST_3", "inference_seconds": 35.845, "status": "success"} {"name": "TEST_5", "inference_seconds": 35.987, "status": "success"} {"name": "TEST_4", "inference_seconds": 35.401, "status": "success"} {"name": "TEST_2", "inference_seconds": 35.755, "status": "success"} {"name": "TEST_59", "inference_seconds": 35.712, "status": "success"} {"name": "TEST_37", "inference_seconds": 35.723, "status": "success"}