This document defines the metrics computed by review-insights and their mathematical foundations.
The analysis dataset is a censored snapshot:
- A PR is included when
since <= pr.createdAt <= until - For included PRs, reviews with
review.createdAt > untilare excluded - PR merge/close state is evaluated as of
until; a PR merged or closed afteruntilis treated as still open - Current-snapshot commit and PR size fields (
commitMessages,additions,deletions) are treated as unobserved for PRs not merged atuntil; commit-trailer-dependentaiCategoryis also unobserved for those PRs.ai-authoredis retained because it is determined from the PR author, not from mutable commit metadata.
This makes historical reruns stable instead of letting later review activity or later pushes leak into an older window.
Source: per-user-stats.ts
Number of unique PRs a user reviewed (not total review submissions). If a reviewer submits multiple reviews on the same PR, it counts as 1.
The active reviewer population is the subset of users with at least one qualifying reviewed PR:
The top-reviewer statistic is defined as the full argmax set rather than a single login:
and
If topReviewers = [] and maxReviewsGiven = null.
For deterministic serialization, topReviewers is sorted in ascending code-unit order of the login strings.
Total review submissions received across all PRs authored by a user. Multiple reviews on the same PR each count separately.
Count of review submissions by state for each reviewer. These are per-submission counts (not per-PR), so:
The inequality holds because the left side counts every submission while the right counts unique PRs.
For each PR authored by a user, the time from PR creation to the earliest qualifying review. Averaged across all PRs that received at least one review.
where createdAt >= pr.createdAt.
The median of the same per-PR first-review latencies used by avgTimeToFirstReview. The median is more robust to outliers (e.g., a single PR left unreviewed for days) and better represents the typical review experience.
For an even number of PRs, the median is the arithmetic mean of the two middle values. Returns null when the user has no PRs with a qualifying first review.
Source: bias-detector.ts
A matrix
Bias detection conditions on both reviewer activity and author activity.
Let:
-
$R$ = reviewers with at least one qualifying review submission -
$A^+$ = authors with at least one qualifying received review submission -
$S = {(i, j) \in R \times A^+ \mid M_{ij} > 0}$ , the observed reviewer-author interaction support
Explicit zero-valued matrix entries are treated as absent support and do not enter
The detector fits a quasi-independence model on
with row and column margins matched to the observed review matrix:
The parameters
Convergence is checked on the fitted row and column margins using the maximum relative margin error:
with a hard cap of 10,000 IPF iterations.
This means a high-volume reviewer paired with a high-volume author is compared against its activity-adjusted expected count
For each observed reviewer-author pair, the detector computes the Pearson residual:
A pair is flagged when both of the following hold:
$M_{ij} > E_{ij}$ -
$r_{ij} > t$ , where$t$ is thebias-thresholdinput (default: 2.0)
The output for each flagged pair includes:
count = M_{ij}expectedCount = E_{ij}pearsonResidual = r_{ij}
If the quasi-independence model cannot be fit numerically, no reviewer-author pair is flagged. In that case the reports surface bias warnings as unavailable rather than claiming that no pair exceeded the threshold. The review matrix and Gini coefficient are still reported because they do not depend on the fitted model.
Note
Interpretation
The Pearson residual is a model diagnostic, not a multiplicity-adjusted significance test.
It answers "how much larger was the observed cell than the activity-adjusted expectation?" rather than "what is the family-wise false positive rate across all pairs?".
Measures inequality of review distribution. Computed from the sorted array of all matrix cell values
-
$G = 0$ : perfectly equal distribution (every pair has the same review count) -
$G \to 1$ : maximally unequal (all reviews concentrated in one pair)
Note
Note on structural zeros
The Pearson residual detector and Gini coefficient use different matrix domains, because they answer different questions:
Pearson residual detector (active interaction submatrix)
The quasi-independence model is fit over the observed interaction support:
- rows are reviewers with at least one qualifying review
- columns are authors with at least one qualifying received review
- only reviewer-author pairs with at least one observed qualifying review enter the fitted support
- explicit zero-valued cells are treated the same as absent pairs and do not enter the fitted support
- unobserved reviewer-author pairs are excluded because the dataset does not record whether they were genuine review opportunities
Authors whose PRs received zero qualifying reviews do not enter this model because their column margin is zero and they cannot contribute to a positive flag.
Gini coefficient (full matrix including zeros)
The Gini coefficient is computed over the full observed-participant reviewer-author matrix, including zero cells. The total number of cells is
-
$R$ = the set of users who submitted at least one qualifying review (i.e., the row keys of the review matrix) -
$A$ = the set of all PR authors in the filtered PR set, including authors whose PRs received zero qualifying reviews -
$D = {u \in R \cap A \mid u \ne \texttt{ghost}}$ , the set of genuine identity overlaps whose diagonal cells are excluded as self-reviews
Zero cells represent observed reviewer/author identities with no qualifying reviews in this matrix; they should not be read as proven review-assignment opportunities because the dataset does not contain the full review-assignment opportunity graph. Including these zeros is essential for measuring concentration across observed participants. Without zeros, a reviewer who reviews only one author would yield
For the shared
ghostplaceholder, the diagonal is retained instead of being subtracted from the Gini matrix domain.ghost -> ghostmay represent two different deleted accounts that GitHub exposed asnull, so treating it as a guaranteed self-review would bias the denominator downward.
Source: merge-correlation.ts
For each author, the average number of qualifying review submissions on their merged PRs.
where until, and reviews are filtered by the same bot/PENDING/self-review rules and must satisfy review.createdAt <= pr.mergedAt.
If null in machine-readable outputs and N/A in the HTML report.
The median of per-PR review counts across merged PRs for each author. Like medianTimeToFirstReview, this is more robust to outliers (e.g., a single PR with an unusually high number of review submissions) and better represents the typical merge experience.
For an even number of merged PRs, the median is the arithmetic mean of the two middle values. Returns null when the author has no merged PRs.
Count of merged PRs by an author that had zero qualifying reviews.
Source: ai-patterns.ts
This module splits into two populations:
- Bot observability metrics (
botReviewers,botReviewPercentage,aiCoAuthoredPRs,totalPRs) ignoreinclude-botsand operate on the full unfiltered dataset.aiCoAuthoredPRsonly counts PRs with observable commit metadata, so it remains a lower-bound estimate when commit metadata is censored by the observation window. humanReviewBurdenuses a comparison cohort that excludes traditional bot-authored PRs (authorIsBot === true) and PRs whoseaiCategoryis unobservable at the cutoff, regardless ofinclude-bots.
where totalReviews includes all reviews (including PENDING and bot reviews) across all PRs.
Note
Note on PENDING review counting
This module intentionally counts PENDING reviews in totalReviews, unlike per-user-stats.ts, bias-detector.ts, and merge-correlation.ts which exclude them. The purpose of this module is to observe the full scope of bot activity, and PENDING bot reviews (e.g., automated checks in progress) are part of that picture. As a result, botReviewPercentage has a different denominator than metrics in other modules — direct cross-module comparison of review counts should account for this difference.
Count of PRs where any observable commit message contains an AI co-author trailer as defined in ai-human-review-burden.md. Only the last commit per PR is inspected (GraphQL limitation), and PRs with observation-window-censored commit metadata are not counted, so this is a lower-bound estimate.
reviewRounds counts distinct reviewed revisions per PR from qualifying human reviews observed at or after PR creation, using the commit SHA attached to each review. PRs are excluded from this metric when an observed review is missing a commit SHA or when the per-PR review list is truncated.
See ai-human-review-burden.md for the full specification of PR classification (ai-authored / ai-assisted / human-only) and per-group human review burden metrics: humanReviewsPerPR, firstReviewLatencyMs, and reviewRounds are reported as distribution statistics (median, p90, mean); unreviewedRate is a single per-group rate; and changeRequestRate is reported as per-PR macro-average statistics (median, mean). These metrics are computed only on the comparison cohort of non-traditional-bot PRs whose AI category is observable at the cutoff.
Source: html-report.ts
| KPI | Definition |
|---|---|
| Pull Requests |
filteredPRs.length - total PRs after author bot filtering |
| Unique PR Reviews |
userStats; uses the per-user qualifying-review filters. When include-bots is false, bot-authored PRs are skipped entirely and bot reviewer reviews are excluded. PENDING reviews are always excluded; self-reviews are excluded when both identities are known, with the shared ghost/UNKNOWN_USER placeholder exempt. |
| Active Reviewers | Count of users in userStats with reviewsGiven > 0; uses the same qualifying-review filters as Unique PR Reviews |
| PR Authors | Count of distinct pr.author values in filteredPRs, after author bot filtering |
| Avg Reviewers/PR | Unique PR Reviews userStats-derived, denominator is filteredPRs-derived |
| Gini Coefficient | From bias detection; uses the bias-detector qualifying-review filters |
| Data Completeness | Collection completeness label, not a post-filtering count |
Source: html-report.ts
This card reports a descriptive ranking over the observed active reviewer population. It does not perform hypothesis testing or claim inferential significance.
| Field | Definition |
|---|---|
| Top reviewers |
topReviewers - the full argmax set of reviewsGiven over users with reviewsGiven > 0
|
| Max reviews given |
maxReviewsGiven - the maximum reviewsGiven among active reviewers |
| Active reviewer population |
reviewerCount - number of users with reviewsGiven > 0
|
| Tie size |
|
The Reviews Given bar chart in the HTML report is also restricted to the active reviewer population so the visual ranking does not include zero-review authors.
Source: burden-chart.ts, rendered in html-report.ts
This section visualizes the human review burden metrics from ai-human-review-burden.md. It appears after the AI & Bot Patterns card.
Traditional bot-authored PRs and PRs whose AI classification is unobservable at the cutoff are excluded from this comparison section even when include-bots is true. The report notes those excluded counts when present. Size-stratified cells also exclude PRs whose size at the cutoff is unobservable.
| Component | Content |
|---|---|
| PR count cards | Sample size (n) and percentage for each AI category (ai-authored, ai-assisted, human-only) |
| Grouped bar charts | One chart per metric — bars show median, whisker lines extend to p90 (where available). Metrics: Reviews/PR, Time to 1st Review, Change Request Rate (median-only, no p90 — see rationale), Review Rounds |
| Detailed metrics table | Median and p90 columns per category, plus Unreviewed Rate (highlighted in red when > 20%) |
| Size-stratified table | Median values per (AI category × size tier) cell, with sample sizes. Cells with < 3 PRs show "—" while retaining n |
- Median over mean — Review counts and latencies follow right-skewed distributions. The median represents typical burden; the mean is inflated by outliers.
- P90 whiskers — Show upper-tail burden at the 90th percentile without requiring box plots; they do not show the maximum or worst case.
- Sample sizes everywhere — Small-n comparisons are misleading; displaying n= lets readers judge statistical reliability.
- Unreviewed Rate alongside latency — Latency is only computed for PRs that received reviews. A high unreviewed rate means the latency metric suffers from survivorship bias.
- Size stratification — PR size can confound review burden. The table compares categories within the same coarse size tier (S/M/L/Empty), which avoids direct cross-tier comparisons but does not isolate AI causality or adjust for other confounders.