Skip to content

Add real admission data pipeline and model calibration engine#1

Merged
YichengYang-Ethan merged 1 commit intomainfrom
claude/analyze-project-uGJQp
Mar 15, 2026
Merged

Add real admission data pipeline and model calibration engine#1
YichengYang-Ethan merged 1 commit intomainfrom
claude/analyze-project-uGJQp

Conversation

@YichengYang-Ethan
Copy link
Member

  • core/admission_data.py: CSV loader with GPA normalization (4/4.3/5/100
    scales), background tier classification, internship scoring, and
    per-program statistics with feature importance analysis
  • core/calibrator.py: Calibration engine that computes data-driven GPA
    thresholds, predicts outcomes, evaluates model accuracy, and generates
    school_ranker overrides
  • Updated school_ranker to accept calibration overrides for data-driven
    reach/target/safety classification
  • CLI: added 'stats' and 'calibrate' commands
  • data/admissions/sample.csv: 30 sample records across 11 programs
  • 45 new tests (218 total), all passing; ruff clean

https://claude.ai/code/session_014dkZ9Eq3DPVaUfRTeN2HXp

- core/admission_data.py: CSV loader with GPA normalization (4/4.3/5/100
  scales), background tier classification, internship scoring, and
  per-program statistics with feature importance analysis
- core/calibrator.py: Calibration engine that computes data-driven GPA
  thresholds, predicts outcomes, evaluates model accuracy, and generates
  school_ranker overrides
- Updated school_ranker to accept calibration overrides for data-driven
  reach/target/safety classification
- CLI: added 'stats' and 'calibrate' commands
- data/admissions/sample.csv: 30 sample records across 11 programs
- 45 new tests (218 total), all passing; ruff clean

https://claude.ai/code/session_014dkZ9Eq3DPVaUfRTeN2HXp
@YichengYang-Ethan YichengYang-Ethan merged commit 9c534e3 into main Mar 15, 2026
5 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a real admissions data pipeline and a calibration engine to derive data-driven thresholds/feature weights from historical outcomes, and integrates those thresholds into the school ranking flow via optional overrides.

Changes:

  • Added core/admission_data.py to load/normalize admissions CSVs and compute per-program statistics (including feature importance).
  • Added core/calibrator.py to compute calibrated program thresholds, evaluate prediction accuracy, and generate school_ranker override dictionaries.
  • Extended CLI and school_ranker to consume/show calibration outputs; added sample/template CSVs and new tests for the new modules.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
core/admission_data.py CSV loader + GPA normalization + background tiering + internship scoring + stats/feature importance.
core/calibrator.py Calibration engine (thresholds, accuracy evaluation, recommendations) + ranker override generation.
core/school_ranker.py Adds optional calibration_overrides and override-aware reach/target/safety classification.
cli/main.py Adds stats and calibrate CLI commands to summarize data and run calibration.
data/admissions/template.csv Adds admissions CSV header template.
data/admissions/sample.csv Adds sample admissions dataset for demonstration/testing.
tests/test_admission_data.py New unit tests for CSV loading, normalization, scoring, and stats computation.
tests/test_calibrator.py New unit tests for calibration, prediction, and override generation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +80 to +108
if scale == 4:
return min(4.0, gpa)

breakpoints = _GPA_SCALE_TO_4.get(scale)
if breakpoints is None:
# Unknown scale — attempt linear conversion
return min(4.0, gpa * 4.0 / scale)

for threshold, mapped_lo, mapped_hi in breakpoints:
if gpa >= threshold:
# Find the top of this segment
# For the highest segment, cap at max GPA
seg_top = scale if breakpoints[0] == (threshold, mapped_lo, mapped_hi) else threshold
# Use previous segment's threshold as the top
idx = breakpoints.index((threshold, mapped_lo, mapped_hi))
if idx == 0:
seg_top = scale
else:
seg_top = breakpoints[idx - 1][0]

if seg_top == threshold:
return mapped_hi

frac = (gpa - threshold) / (seg_top - threshold)
return mapped_lo + frac * (mapped_hi - mapped_lo)

return 0.0


Comment on lines +187 to +198
# Count internships (Chinese: 段)
for char in "段":
count = desc.count(char)
if count > 0:
# Extract number before 段
for i, c in enumerate(desc):
if c == "段":
if i > 0 and desc[i - 1].isdigit():
n = int(desc[i - 1])
score += min(n * 1.5, 5.0)
break

for feat in feature_sums
}
total = sum(raw.values()) or 1.0
return {feat: round(val / total, 3) for feat, val in sorted(raw.items(), key=lambda x: -x[1])}
Comment on lines +101 to +124
threshold = ProgramThreshold(
program_id=stats.program_id,
sample_size=stats.total_records,
confidence=_confidence_level(stats.total_records),
observed_acceptance_rate=stats.observed_acceptance_rate,
)

if accepted:
# GPA floor: minimum GPA among accepted applicants
gpas_accepted = [r.gpa_normalized for r in accepted]
threshold.gpa_floor = min(gpas_accepted)
threshold.gpa_target = sum(gpas_accepted) / len(gpas_accepted)
# Safe threshold: 90th percentile of accepted
sorted_gpas = sorted(gpas_accepted)
p90_idx = int(len(sorted_gpas) * 0.9)
threshold.gpa_safe = sorted_gpas[min(p90_idx, len(sorted_gpas) - 1)]

# Background tier
threshold.max_bg_tier_accepted = max(r.bg_tier for r in accepted)

# Intern score
intern_scores = [r.intern_score for r in accepted]
threshold.min_intern_score_accepted = min(intern_scores)

Comment on lines +215 to +218
result_entry["calibrated"] = True
result_entry["confidence"] = prog_overrides.get("confidence", "low")
result_entry["sample_size"] = prog_overrides.get("sample_size", 0)

from __future__ import annotations

import csv
import tempfile
Comment on lines +199 to +215
# Quality keywords (Chinese + English)
quality_keywords = {
"顶级": 2.0, "top": 1.5, "百亿": 1.5, "头部": 1.5,
"一线": 1.0, "知名": 0.8, "大型": 0.5,
}
for kw, pts in quality_keywords.items():
if kw in desc:
score += pts

# Type keywords
type_keywords = {
"量化": 1.5, "quant": 1.5, "投行": 1.5, "ib": 1.0,
"对冲": 1.5, "hedge": 1.5, "私募": 1.0, "qr": 1.0,
"trading": 1.0, "研究": 0.8, "金工": 0.8,
"三中一华": 2.0, "高盛": 2.0, "goldman": 2.0,
"摩根": 2.0, "morgan": 1.5, "kaggle": 1.5,
}
Comment on lines +184 to +190
# Classification (with optional data-driven overrides).
prog_overrides = overrides.get(prog.id)
category = _classify(
user_gpa=profile.gpa,
program_avg_gpa=prog.avg_gpa,
acceptance_rate=prog.acceptance_rate,
overrides=prog_overrides,
Comment on lines +779 to +785
console.print(Panel("Ranker Overrides (Applied)", border_style="green"))
for pid, ov in sorted(overrides.items()):
console.print(
f" {pid}: reach<{ov['reach_gpa_threshold']:.2f} "
f"safe>={ov['safety_gpa_threshold']:.2f} "
f"[dim](n={ov['sample_size']}, {ov['confidence']})[/dim]"
)
Comment on lines +7 to +16
from core.admission_data import AdmissionRecord
from core.calibrator import (
CalibrationResult,
ProgramThreshold,
calibrate_all,
calibrate_program,
generate_ranker_overrides,
predict_outcome,
)
from core.admission_data import compute_program_stats
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants