Skip to content

Latest commit

 

History

History
76 lines (50 loc) · 5.58 KB

File metadata and controls

76 lines (50 loc) · 5.58 KB

Filtering Specification

This document describes how pull requests and reviews are filtered before analysis.

include-bots (default: false)

The include-bots input controls whether bot accounts are included in statistics. Bot detection is based on the GitHub user type (Bot) or login suffixes ([bot], -bot).

When include-bots is false

Two independent filters are applied:

  1. Author filter - PRs where authorIsBot is true are skipped entirely. All reviews on that PR are also excluded, even if the reviewers are human.
  2. Reviewer filter - Individual reviews where reviewerIsBot is true are excluded from metrics, even on human-authored PRs.

Both filters must pass for a review to be counted. A human review on a bot-authored PR is not counted.

When include-bots is true

No additional include-bots filtering is applied in modules that honor the flag. The ai-patterns module keeps its documented split: bot observability metrics ignore include-bots and use the full dataset, with aiCoAuthoredPRs limited to PRs with observable commit metadata; humanReviewBurden still excludes traditional bot-authored PRs and PRs whose AI classification is not observable at the cutoff from the comparison cohort.

Per-module behavior

Module Author filter Reviewer filter Notes
per-user-stats Yes Yes Skips entire PR if author is bot; skips individual bot reviews
bias-detector Yes Yes Same as per-user-stats
merge-correlation Yes Yes Bot reviews excluded from review counts on merged PRs
ai-patterns Mixed Mixed Top-level bot observability metrics ignore include-bots and use the full dataset; only aiCoAuthoredPRs is limited to PRs with observable commit metadata. humanReviewBurden always excludes traditional bot-authored PRs, PRs whose AI classification is not observable at the cutoff, and bot reviews from the comparison metrics
html-report KPIs: Pull Requests, PR Authors Yes N/A Uses the author-filtered PR list; reviewer identities are not part of these counts
html-report KPIs: Unique PR Reviews, Active Reviewers Yes Yes Derived from userStats; when include-bots is false, bot-authored PRs are skipped entirely and bot reviewer reviews are excluded. PENDING reviews are always excluded; self-reviews are excluded only when both identities are known (ghost is exempt).
html-report KPI: Avg Reviewers/PR Mixed Mixed Numerator is Unique PR Reviews from userStats; denominator is Pull Requests from the author-filtered PR list
html-report KPI: Gini Coefficient Yes Yes Derived from bias-detector
html-report KPI: Data Completeness N/A N/A Reports collection completeness, not a post-filtering count
time-series Yes N/A Receives the pre-filtered PR list from html-report. When include-bots is false, bot-authored PRs are excluded there; when true, all PRs are included. Bot reviews and self-reviews are not excluded, so the review count reflects all non-PENDING review activity on that input list.

For ai-patterns, this split is intentional: bot observability (botReviewers, botReviewPercentage, aiCoAuthoredPRs, totalPRs) ignores include-bots and uses the full dataset, while aiCoAuthoredPRs only counts PRs with observable commit metadata. humanReviewBurden uses a comparison cohort that excludes traditional bot-authored PRs regardless of include-bots and excludes PRs whose AI classification is not observable at the cutoff.

Rationale

Bot-authored PRs (e.g., Dependabot) are excluded entirely because:

  • They do not reflect human team review workload.
  • Including human reviews on bot PRs would inflate reviewer counts and distort bias detection.
  • The ai-patterns module separately tracks bot activity for observability while excluding traditional bot-authored PRs and unobservable AI classifications from the AI-vs-human burden comparison.

Additional filters (always applied)

PENDING reviews

Reviews with state === "PENDING" are draft/unsubmitted reviews. They are excluded from the following modules:

  • per-user-stats
  • bias-detector
  • merge-correlation
  • time-series
  • ai-patterns (human review burden metrics only - getQualifyingHumanReviews excludes PENDING)

Note

The ai-patterns module's top-level metrics (totalReviews, botReviewPercentage) intentionally include PENDING reviews to capture the full scope of bot activity. See statistics.md for details. As a result, botReviewPercentage has a different denominator than metrics in other modules.

Self-reviews

Reviews where the reviewer is the PR author are excluded from:

  • per-user-stats
  • bias-detector
  • merge-correlation
  • ai-patterns (human review burden metrics)

These do not represent peer review activity. In merge-correlation specifically, self-reviews must not count toward avgReviewsBeforeMerge or affect zeroReviewMerges, as these metrics measure whether a PR received independent peer review before merging.

Exception: ghost placeholder - When GraphQL returns null for a deleted user account, the normalizer substitutes the shared placeholder ghost. The self-review exclusion is skipped when either the reviewer or the author login is ghost, to avoid incorrectly collapsing two unrelated deleted users onto the same identity. This guard applies to all modules listed above.

In bias-detector, this same exception also applies when constructing the Gini matrix domain: the ghost -> ghost diagonal remains an eligible cell and is not subtracted as a structurally impossible self-review pair.