Test Flakiness Detector / `flaky`

Part of Tuulbelt — A collection of zero-dependency tools.

Test Flakiness Detector / `flaky`

Detect unreliable tests by running them multiple times and tracking failure rates.

Problem

Flaky tests—tests that pass sometimes and fail sometimes—are a major pain in software development. They undermine confidence in test suites, cause false alarms in CI/CD pipelines, and waste developer time investigating spurious failures.

This tool runs your test command multiple times and identifies which tests have intermittent failures, helping you target and fix the real problems.

Features

Zero external dependencies — Uses cli-progress-reporting (Tuulbelt tool composition)
Works with Node.js 18+
TypeScript support with strict mode
Composable via CLI or library API
Works with any test command (npm test, cargo test, pytest, etc.)
Configurable number of test runs
Detailed JSON output with failure statistics
Real-time progress tracking for runs ≥ 5
Verbose mode for debugging

Installation

Clone the repository:

git clone https://github.com/tuulbelt/test-flakiness-detector.git
cd test-flakiness-detector
npm install  # Install dev dependencies only

CLI names - both short and long forms work:

Short (recommended): flaky
Long: test-flakiness-detector

Recommended setup - install globally for easy access:

npm link  # Enable the 'flaky' command globally
flaky --help

Dependencies: Uses cli-progress-reporting for progress tracking (automatically fetched from GitHub during npm install). Zero external dependencies per PRINCIPLES.md Exception 2.

Usage

As a CLI

# Run npm test 10 times (default)
flaky --test "npm test"

# Run with 20 iterations
flaky --test "npm test" --runs 20

# Run cargo tests with verbose output
flaky --test "cargo test" --runs 15 --verbose

# Show help
flaky --help

As a Library

import { detectFlakiness } from './src/index.js';

const report = await detectFlakiness({
  testCommand: 'npm test',
  runs: 10,
  verbose: false
});

if (report.success) {
  console.log(`Total runs: ${report.totalRuns}`);
  console.log(`Passed: ${report.passedRuns}, Failed: ${report.failedRuns}`);

  if (report.flakyTests.length > 0) {
    console.log('\nFlaky tests detected:');
    report.flakyTests.forEach(test => {
      console.log(`  ${test.testName}: ${test.failureRate.toFixed(1)}% failure rate`);
      console.log(`    Passed: ${test.passed}/${test.totalRuns}`);
      console.log(`    Failed: ${test.failed}/${test.totalRuns}`);
    });
  } else {
    console.log('No flaky tests detected');
  }
} else {
  console.error(`Error: ${report.error}`);
}

CLI Options

-t, --test <command> — Test command to execute (required)
-r, --runs <number> — Number of times to run the test (default: 10, max: 1000)
-v, --verbose — Enable verbose output showing each test run
-h, --help — Show help message

Output Format

The tool outputs a JSON report with the following structure:

{
  "success": true,
  "totalRuns": 10,
  "passedRuns": 7,
  "failedRuns": 3,
  "flakyTests": [
    {
      "testName": "Test Suite",
      "passed": 7,
      "failed": 3,
      "totalRuns": 10,
      "failureRate": 30.0
    }
  ],
  "runs": [
    {
      "success": true,
      "exitCode": 0,
      "stdout": "...",
      "stderr": ""
    }
    // ... more run results
  ]
}

Exit Codes

0 — Detection completed successfully and no flaky tests found
1 — Either invalid arguments, execution error, or flaky tests detected

Examples

Detect Flaky npm Tests

flaky --test "npm test" --runs 20

Detect Flaky Rust Tests

flaky --test "cargo test" --runs 15

Detect Flaky Python Tests

flaky --test "pytest tests/" --runs 10

With Verbose Output

flaky --test "npm test" --runs 5 --verbose

This will show:

[INFO] Running test command 5 times: npm test
[INFO] Run 1/5
[RUN] Executing: npm test
[INFO] Run 2/5
...
[INFO] Completed 5 runs: 4 passed, 1 failed
[WARN] Detected flaky tests!

Example Outputs

See what to expect from the tool with these real examples:

📊 Example 1: All Tests Passing (click to expand)

flaky --test "echo 'test passed'" --runs 5

{
  "success": true,
  "totalRuns": 5,
  "passedRuns": 5,
  "failedRuns": 0,
  "flakyTests": [],
  "runs": [
    {
      "success": true,
      "exitCode": 0,
      "stdout": "test passed\n",
      "stderr": ""
    }
    // ... 4 more successful runs
  ]
}

Result: ✅ No flaky tests detected. All 5 runs passed consistently.

📊 Example 2: All Tests Failing (click to expand)

flaky --test "exit 1" --runs 3

{
  "success": true,
  "totalRuns": 3,
  "passedRuns": 0,
  "failedRuns": 3,
  "flakyTests": [],
  "runs": [
    {
      "success": false,
      "exitCode": 1,
      "stdout": "",
      "stderr": ""
    }
    // ... 2 more failed runs
  ]
}

Result: ✅ No flakiness detected. Tests fail consistently (not intermittent).

🔴 Example 3: Flaky Tests Detected (click to expand)

flaky --test 'node -e "process.exit(Math.random() > 0.5 ? 0 : 1)"' --runs 20

{
  "success": true,
  "totalRuns": 20,
  "passedRuns": 11,
  "failedRuns": 9,
  "flakyTests": [
    {
      "testName": "Test Suite",
      "passed": 11,
      "failed": 9,
      "totalRuns": 20,
      "failureRate": 45.0
    }
  ],
  "runs": [
    // Mix of passing and failing runs
  ]
}

Result: ⚠️ Flaky test detected! 45% failure rate (9 failures out of 20 runs). Action: This test needs investigation and fixing.

💬 Example 4: Verbose Mode Output (click to expand)

flaky --test "echo 'test'" --runs 3 --verbose

[INFO] Running test command 3 times: echo 'test'
[INFO] Run 1/3
[RUN] Executing: echo 'test'
[INFO] Run 2/3
[RUN] Executing: echo 'test'
[INFO] Run 3/3
[RUN] Executing: echo 'test'
[INFO] Completed 3 runs: 3 passed, 0 failed
{
  "success": true,
  "totalRuns": 3,
  "passedRuns": 3,
  "failedRuns": 0,
  "flakyTests": [],
  ...
}

Result: Shows detailed execution logs plus JSON output.

Note: Example outputs are auto-generated and committed to examples/outputs/ directory. See .github/workflows/create-demo.yml for automation details.

Try It Yourself

Quick Start

git clone https://github.com/tuulbelt/test-flakiness-detector.git
cd test-flakiness-detector
npm install
flaky --test "npm test" --runs 10

One-Click Playground

Try the tool instantly in your browser without installing anything!

How It Works

The tool executes the specified test command N times
It captures the exit code, stdout, and stderr for each run
It tracks how many times the tests passed vs. failed
If some runs pass and some fail, the test suite is flagged as flaky
A detailed JSON report is generated with failure statistics

Architecture

Execution Strategy:

Validate input (test command is non-empty string, runs are between 1-1000)
Execute test command synchronously N times using child_process.execSync
Capture exit code, stdout, and stderr for each execution
Record pass (exit code 0) or fail (non-zero exit code) for each run
Calculate flakiness: if passedRuns > 0 AND failedRuns > 0, tests are flaky
Generate comprehensive JSON report with all run results and statistics

Dependencies:

Node.js standard library — Uses child_process module for command execution
cli-progress-reporting — Tuulbelt tool for real-time progress tracking (PRINCIPLES.md Exception 2)
Zero external dependencies

Key Design Choices:

Synchronous execution: Tests run sequentially to avoid false flakiness from resource contention
Suite-level detection: Tracks entire test command success/failure, not individual test names
No timeout: Waits for command completion to avoid flagging slow tests as flaky
Result pattern: Returns structured result object, never throws exceptions

Security

Trusted commands only: The --test command is executed via shell—only run trusted commands
Same trust model: This tool has the same security model as npm run-script, make, or any build tool
Resource limits: 10MB buffer limit per command output, max 1000 runs
No privilege escalation: User runs their own commands with their own permissions

Testing

npm test              # Run all tests
npm test -- --watch   # Watch mode

The test suite includes:

Basic functionality tests
Input validation tests
Flaky test detection tests
Edge case handling tests
Error scenario tests
Property-based fuzzy tests

Dogfooding: Tool Composition

This tool demonstrates the power of composability by both USING and VALIDATING other Tuulbelt tools:

1. Uses CLI Progress Reporting (Library Integration)

flaky --test "npm test" --runs 10 --verbose
# [INFO] Progress tracking enabled (dogfooding cli-progress-reporting)
# [INFO] Run 1/10
# [INFO] Run 2/10
# ...

The flakiness detector requires cli-progress-reporting as a dependency (automatically fetched from GitHub). This demonstrates Tuulbelt-to-Tuulbelt composition (PRINCIPLES.md Exception 2) and provides:

Live run counts and pass/fail status during detection (≥5 runs)
Better UX for long detection runs (50-100 iterations)
Real-world validation of the progress reporting tool
Proof that Tuulbelt tools compose naturally while preserving zero external dependencies

2. High-Value Composition Scripts

Output Diffing Utility - Find ROOT CAUSE of flaky tests:

./scripts/dogfood-diff.sh "npm test"
# Compares outputs between runs to see WHAT changes
# Helps identify: timestamps, random data, race conditions

Cross-Platform Path Normalizer - Validate path handling reliability:

./scripts/dogfood-paths.sh 10
# ✅ NO FLAKINESS DETECTED
# 145 tests × 10 runs = 1,450 executions

CLI Progress Reporting - Bidirectional validation:

./scripts/dogfood-progress.sh 20
# Validates the tool we USE (bidirectional relationship)
# 125 tests × 20 runs = 2,500 executions

Complete Phase 1 Validation Pipeline - Validate all tools:

./scripts/dogfood-pipeline.sh 10
# Validates all 5 Phase 1 tools
# 602 tests × 10 runs = 6,020 total test executions

See DOGFOODING_STRATEGY.md for implementation details.

Error Handling

The tool handles various error scenarios gracefully:

Invalid or non-existent commands
Command syntax errors
Commands that hang or timeout
Empty or malformed input

Errors are returned in the error field of the result object, not thrown.

Performance

Time complexity: O(N × T) where N = number of runs, T = time per test execution
Space complexity: O(N × S) where N = number of runs, S = size of stdout/stderr per run
Resource limits:
- Maximum runs: 1000 (configurable, prevents resource exhaustion)
- Maximum buffer per run: 10MB (stdout + stderr combined)
- No artificial timeout (waits for natural command completion)
Execution: Sequential, not parallel (avoids false flakiness from resource contention)

Limitations

Currently detects flakiness at the test suite level (entire command pass/fail)
Does not parse individual test names from test runner output
Maximum of 1000 runs per detection (to prevent resource exhaustion)
stdout/stderr buffer limited to 10MB per run

Future Enhancements

Potential improvements for future versions:

Parse individual test names from popular test runners (Jest, Mocha, pytest, cargo test)
Track flakiness per individual test, not just test suite
Calculate statistical confidence intervals for failure rates
Support for parallel test execution to speed up detection
Integration with CI/CD systems (GitHub Actions, GitLab CI)

Specification

See SPEC.md for detailed technical specification.

Demo

▶ View interactive recording on asciinema.org

Try it online:

License

MIT — see LICENSE

Contributing

See CONTRIBUTING.md for contribution guidelines.

Related Tools

Part of the Tuulbelt collection:

CLI Progress Reporting — Concurrent-safe progress updates
More tools coming soon...

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
scripts		scripts
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
DOGFOODING_STRATEGY.md		DOGFOODING_STRATEGY.md
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
STATUS.md		STATUS.md
demo-url.txt		demo-url.txt
demo.cast		demo.cast
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

License

tuulbelt/test-flakiness-detector

Folders and files

Latest commit

History

Repository files navigation

Test Flakiness Detector / flaky

Problem

Features

Installation

Usage

As a CLI

As a Library

CLI Options

Output Format

Exit Codes

Examples

Detect Flaky npm Tests

Detect Flaky Rust Tests

Detect Flaky Python Tests

With Verbose Output

Example Outputs

Try It Yourself

Quick Start

One-Click Playground

How It Works

Architecture

Security

Testing

Dogfooding: Tool Composition

1. Uses CLI Progress Reporting (Library Integration)

2. High-Value Composition Scripts

Error Handling

Performance

Limitations

Future Enhancements

Specification

Demo

License

Contributing

Related Tools

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Test Flakiness Detector / `flaky`

Packages