Part of Tuulbelt β A collection of zero-dependency tools.
Detect unreliable tests by running them multiple times and tracking failure rates.
Flaky testsβtests that pass sometimes and fail sometimesβare a major pain in software development. They undermine confidence in test suites, cause false alarms in CI/CD pipelines, and waste developer time investigating spurious failures.
This tool runs your test command multiple times and identifies which tests have intermittent failures, helping you target and fix the real problems.
- Zero external dependencies β Uses cli-progress-reporting (Tuulbelt tool composition)
- Works with Node.js 18+
- TypeScript support with strict mode
- Composable via CLI or library API
- Works with any test command (npm test, cargo test, pytest, etc.)
- Configurable number of test runs
- Detailed JSON output with failure statistics
- Real-time progress tracking for runs β₯ 5
- Verbose mode for debugging
Clone the repository:
git clone https://github.com/tuulbelt/test-flakiness-detector.git
cd test-flakiness-detector
npm install # Install dev dependencies onlyCLI names - both short and long forms work:
- Short (recommended):
flaky - Long:
test-flakiness-detector
Recommended setup - install globally for easy access:
npm link # Enable the 'flaky' command globally
flaky --helpDependencies: Uses cli-progress-reporting for progress tracking (automatically fetched from GitHub during npm install). Zero external dependencies per PRINCIPLES.md Exception 2.
# Run npm test 10 times (default)
flaky --test "npm test"
# Run with 20 iterations
flaky --test "npm test" --runs 20
# Run cargo tests with verbose output
flaky --test "cargo test" --runs 15 --verbose
# Show help
flaky --helpimport { detectFlakiness } from './src/index.js';
const report = await detectFlakiness({
testCommand: 'npm test',
runs: 10,
verbose: false
});
if (report.success) {
console.log(`Total runs: ${report.totalRuns}`);
console.log(`Passed: ${report.passedRuns}, Failed: ${report.failedRuns}`);
if (report.flakyTests.length > 0) {
console.log('\nFlaky tests detected:');
report.flakyTests.forEach(test => {
console.log(` ${test.testName}: ${test.failureRate.toFixed(1)}% failure rate`);
console.log(` Passed: ${test.passed}/${test.totalRuns}`);
console.log(` Failed: ${test.failed}/${test.totalRuns}`);
});
} else {
console.log('No flaky tests detected');
}
} else {
console.error(`Error: ${report.error}`);
}-t, --test <command>β Test command to execute (required)-r, --runs <number>β Number of times to run the test (default: 10, max: 1000)-v, --verboseβ Enable verbose output showing each test run-h, --helpβ Show help message
The tool outputs a JSON report with the following structure:
{
"success": true,
"totalRuns": 10,
"passedRuns": 7,
"failedRuns": 3,
"flakyTests": [
{
"testName": "Test Suite",
"passed": 7,
"failed": 3,
"totalRuns": 10,
"failureRate": 30.0
}
],
"runs": [
{
"success": true,
"exitCode": 0,
"stdout": "...",
"stderr": ""
}
// ... more run results
]
}0β Detection completed successfully and no flaky tests found1β Either invalid arguments, execution error, or flaky tests detected
flaky --test "npm test" --runs 20flaky --test "cargo test" --runs 15flaky --test "pytest tests/" --runs 10flaky --test "npm test" --runs 5 --verboseThis will show:
[INFO] Running test command 5 times: npm test
[INFO] Run 1/5
[RUN] Executing: npm test
[INFO] Run 2/5
...
[INFO] Completed 5 runs: 4 passed, 1 failed
[WARN] Detected flaky tests!
See what to expect from the tool with these real examples:
π Example 1: All Tests Passing (click to expand)
flaky --test "echo 'test passed'" --runs 5{
"success": true,
"totalRuns": 5,
"passedRuns": 5,
"failedRuns": 0,
"flakyTests": [],
"runs": [
{
"success": true,
"exitCode": 0,
"stdout": "test passed\n",
"stderr": ""
}
// ... 4 more successful runs
]
}Result: β No flaky tests detected. All 5 runs passed consistently.
π Example 2: All Tests Failing (click to expand)
flaky --test "exit 1" --runs 3{
"success": true,
"totalRuns": 3,
"passedRuns": 0,
"failedRuns": 3,
"flakyTests": [],
"runs": [
{
"success": false,
"exitCode": 1,
"stdout": "",
"stderr": ""
}
// ... 2 more failed runs
]
}Result: β No flakiness detected. Tests fail consistently (not intermittent).
π΄ Example 3: Flaky Tests Detected (click to expand)
flaky --test 'node -e "process.exit(Math.random() > 0.5 ? 0 : 1)"' --runs 20{
"success": true,
"totalRuns": 20,
"passedRuns": 11,
"failedRuns": 9,
"flakyTests": [
{
"testName": "Test Suite",
"passed": 11,
"failed": 9,
"totalRuns": 20,
"failureRate": 45.0
}
],
"runs": [
// Mix of passing and failing runs
]
}Result:
π¬ Example 4: Verbose Mode Output (click to expand)
flaky --test "echo 'test'" --runs 3 --verbose[INFO] Running test command 3 times: echo 'test'
[INFO] Run 1/3
[RUN] Executing: echo 'test'
[INFO] Run 2/3
[RUN] Executing: echo 'test'
[INFO] Run 3/3
[RUN] Executing: echo 'test'
[INFO] Completed 3 runs: 3 passed, 0 failed
{
"success": true,
"totalRuns": 3,
"passedRuns": 3,
"failedRuns": 0,
"flakyTests": [],
...
}
Result: Shows detailed execution logs plus JSON output.
Note: Example outputs are auto-generated and committed to
examples/outputs/directory. See.github/workflows/create-demo.ymlfor automation details.
git clone https://github.com/tuulbelt/test-flakiness-detector.git
cd test-flakiness-detector
npm install
flaky --test "npm test" --runs 10Try the tool instantly in your browser without installing anything!
- The tool executes the specified test command N times
- It captures the exit code, stdout, and stderr for each run
- It tracks how many times the tests passed vs. failed
- If some runs pass and some fail, the test suite is flagged as flaky
- A detailed JSON report is generated with failure statistics
Execution Strategy:
- Validate input (test command is non-empty string, runs are between 1-1000)
- Execute test command synchronously N times using
child_process.execSync - Capture exit code, stdout, and stderr for each execution
- Record pass (exit code 0) or fail (non-zero exit code) for each run
- Calculate flakiness: if passedRuns > 0 AND failedRuns > 0, tests are flaky
- Generate comprehensive JSON report with all run results and statistics
Dependencies:
- Node.js standard library β Uses
child_processmodule for command execution - cli-progress-reporting β Tuulbelt tool for real-time progress tracking (PRINCIPLES.md Exception 2)
- Zero external dependencies
Key Design Choices:
- Synchronous execution: Tests run sequentially to avoid false flakiness from resource contention
- Suite-level detection: Tracks entire test command success/failure, not individual test names
- No timeout: Waits for command completion to avoid flagging slow tests as flaky
- Result pattern: Returns structured result object, never throws exceptions
- Trusted commands only: The
--testcommand is executed via shellβonly run trusted commands - Same trust model: This tool has the same security model as
npm run-script,make, or any build tool - Resource limits: 10MB buffer limit per command output, max 1000 runs
- No privilege escalation: User runs their own commands with their own permissions
npm test # Run all tests
npm test -- --watch # Watch modeThe test suite includes:
- Basic functionality tests
- Input validation tests
- Flaky test detection tests
- Edge case handling tests
- Error scenario tests
- Property-based fuzzy tests
This tool demonstrates the power of composability by both USING and VALIDATING other Tuulbelt tools:
flaky --test "npm test" --runs 10 --verbose
# [INFO] Progress tracking enabled (dogfooding cli-progress-reporting)
# [INFO] Run 1/10
# [INFO] Run 2/10
# ...The flakiness detector requires cli-progress-reporting as a dependency (automatically fetched from GitHub). This demonstrates Tuulbelt-to-Tuulbelt composition (PRINCIPLES.md Exception 2) and provides:
- Live run counts and pass/fail status during detection (β₯5 runs)
- Better UX for long detection runs (50-100 iterations)
- Real-world validation of the progress reporting tool
- Proof that Tuulbelt tools compose naturally while preserving zero external dependencies
Output Diffing Utility - Find ROOT CAUSE of flaky tests:
./scripts/dogfood-diff.sh "npm test"
# Compares outputs between runs to see WHAT changes
# Helps identify: timestamps, random data, race conditionsCross-Platform Path Normalizer - Validate path handling reliability:
./scripts/dogfood-paths.sh 10
# β
NO FLAKINESS DETECTED
# 145 tests Γ 10 runs = 1,450 executionsCLI Progress Reporting - Bidirectional validation:
./scripts/dogfood-progress.sh 20
# Validates the tool we USE (bidirectional relationship)
# 125 tests Γ 20 runs = 2,500 executionsComplete Phase 1 Validation Pipeline - Validate all tools:
./scripts/dogfood-pipeline.sh 10
# Validates all 5 Phase 1 tools
# 602 tests Γ 10 runs = 6,020 total test executionsSee DOGFOODING_STRATEGY.md for implementation details.
The tool handles various error scenarios gracefully:
- Invalid or non-existent commands
- Command syntax errors
- Commands that hang or timeout
- Empty or malformed input
Errors are returned in the error field of the result object, not thrown.
- Time complexity: O(N Γ T) where N = number of runs, T = time per test execution
- Space complexity: O(N Γ S) where N = number of runs, S = size of stdout/stderr per run
- Resource limits:
- Maximum runs: 1000 (configurable, prevents resource exhaustion)
- Maximum buffer per run: 10MB (stdout + stderr combined)
- No artificial timeout (waits for natural command completion)
- Execution: Sequential, not parallel (avoids false flakiness from resource contention)
- Currently detects flakiness at the test suite level (entire command pass/fail)
- Does not parse individual test names from test runner output
- Maximum of 1000 runs per detection (to prevent resource exhaustion)
- stdout/stderr buffer limited to 10MB per run
Potential improvements for future versions:
- Parse individual test names from popular test runners (Jest, Mocha, pytest, cargo test)
- Track flakiness per individual test, not just test suite
- Calculate statistical confidence intervals for failure rates
- Support for parallel test execution to speed up detection
- Integration with CI/CD systems (GitHub Actions, GitLab CI)
See SPEC.md for detailed technical specification.
βΆ View interactive recording on asciinema.org
MIT β see LICENSE
See CONTRIBUTING.md for contribution guidelines.
Part of the Tuulbelt collection:
- CLI Progress Reporting β Concurrent-safe progress updates
- More tools coming soon...
