Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
input/
output/
data/
db/
notebooks/
simulated_data/
templates/
Expand All @@ -18,3 +17,6 @@ build
*.egg-info/
*.csv
db/old/
static/
templates/
data/old_20250718/
102 changes: 40 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,113 +2,91 @@

This project is a comprehensive pipeline designed to match mentees with suitable mentors based on their professional profiles and research interests. It leverages Large Language Models (LLMs) for summarization, evaluation, and vector embeddings to find the best possible matches from a corpus of mentor CVs.

## Dataflow Diagrams
## Dataflow and Caching

### 1. Data Processing and Indexing (One-Time Setup)

This initial pipeline processes raw mentor CVs, summarizes them, and builds a searchable FAISS vector index. This only needs to be run once or when the mentor pool changes.
The pipeline is designed to be robust and efficient, using a single `data/mentor_data.csv` file as the source of truth for all mentor information. It intelligently checks the state of this file to avoid re-running expensive processing steps.

```mermaid
flowchart LR
A["Mentor CVs (PDFs/DOCX)"] --> B["io_utils: load_documents()"];
B --> C["text_utils: clean_and_validate_text()"];
C --> D["main.py: Caches to mentor_data.csv"];
D --> E["batch.py: summarize_cvs()"];
E --> F["main.py: Caches to mentor_data_with_summaries.csv"];
F --> G["build_index.py: build_index()"];
G --> H["utils.py: find_professor_type() & rank_professors()"];
H --> I["main.py: Caches to mentor_data_summaries_ranks.csv"];
I --> J["build_index.py: Creates FAISS Index"];
J --> K[("db/embedding-model-name/index.faiss")];
```

### 2. Mentee Matching (Per-Mentee Execution)

This pipeline runs for each new mentee to find the best matches from the pre-built index.

```mermaid
flowchart LR
subgraph "Mentee Input"
L["Mentee Info (JSON)"] --> M["main.py: Parses JSON"];
M --> N["Mentee CV Path & Preferences"];
end

subgraph "Candidate Retrieval & Evaluation"
O[("FAISS Index")] --> P["search_candidate_mentors.py"];
N --> P;
P --> Q["Top-K Similarity Search"];
subgraph "Mentor Data Pipeline (Runs only when needed)"
A["--mentors dir (PDFs/DOCX)"] --> B{"main.py"};
B -- "1. Load/Update" --> C["data/mentor_data.csv"];
C -- "2. Check for 'Mentor_Summary' column" --> B;
B -- "3. Summarize (if needed)" --> C;
C -- "4. Check for 'Rank' column" --> B;
B -- "5. Rank (if needed)" --> C;
C -- "6. Check for FAISS index" --> B;
B -- "7. Build Index (if needed)" --> D[("db/embedding-model/index.faiss")];
end

subgraph "LLM-based Re-ranking"
Q --> R["evaluate_matches.py: evaluate_pair_with_llm()"];
R --> S["evaluate_matches.py: extract_eval_scores_with_llm()"];
S --> T["main.py: Sorts by 'Overall Match Quality'"];
subgraph "Mentee Matching Pipeline"
E["--mentees dir (JSON + CV)"] --> F{"main.py"};
D --> F;
F --> G[/"output/best_matches.json"/];
end

T --> U[/"output/best_matches.json"/];
```

- **Intelligent Caching**: The pipeline checks for the existence of `data/mentor_data.csv` and its columns (`Mentor_Summary`, `Rank`) to determine which steps to run. For example, if the `Mentor_Summary` column is already present, the summarization step is skipped.
- **Atomic Writes**: All updates to `data/mentor_data.csv` are performed atomically to prevent data corruption if the script is interrupted.

## How to Use the Pipeline

### Mentee Input Data Structure

Before running the matching process, you must structure the mentee input data correctly inside the `input/` directory.

1. **Create a subdirectory for each mentee.** The name of the subdirectory should be the mentee's email address (e.g., `input/john.doe@email.com/`).

2. **Inside each mentee's subdirectory, add their CV file(s)** (e.g., `.pdf`, `.docx`).

3. **Add a JSON file containing the mentee's information.** The script will automatically detect and use the first JSON file it finds in the directory. The filename can be anything, but the content must follow this structure:
3. **Add a JSON file containing the mentee's information.** The script uses the first JSON file it finds in the directory. The content must follow this structure:

```json
{
"first_name": "Katelyn",
"last_name": "Senkus",
"role": "Mentee",
"research_Interest": [
"Team Science (laboratory and clinical collaborations)",
"Translational Research (bench-to-bedside)",
"Lab-based/Bench Research"
"Team Science",
"Translational Research",
"Lab-based Research"
],
"submissions_files": [
"Senkus_CV_3-26-25.docx"
]
"submissions_files": ["Senkus_CV_3-26-25.docx"]
}
```
- `first_name`: The mentee's first name.
- `last_name`: The mentee's last name.
- `research_Interest`: A list of strings representing the mentee's research interests, ranked in order of preference.
- `submissions_files`: A list containing the filename of the CV to be used for matching. The script will find this file within the same directory, even if it has a timestamp prefix (e.g., `1743173574187_Senkus_CV_3-26-25.docx`).

### Running the Pipeline

The entire pipeline is executed from the root directory via the `main.py` script.

#### Command-Line Arguments
- `--mentees`: **(Required)** Path to the root directory containing mentee subdirectories (e.g., `input/`).
- `--mentors`: **(Required)** Path to the root directory containing mentor CVs. The script will search this directory and all its subdirectories.
- `--num_mentors`: **(Required)** The number of initial candidates to retrieve from the similarity search for each mentee.
- `--overwrite`: **(Optional)** A flag to force the script to ignore all cached files and re-run the entire data processing pipeline from scratch.
- `--mentors`: **(Optional)** Path to the root directory containing mentor CVs. This is **only required** if `data/mentor_data.csv` does not exist or if you are running with the `--overwrite` flag.
- `--overwrite`: **(Optional)** A flag to force the script to re-run the entire data processing pipeline from scratch, deleting all cached data.

#### Examples

#### Example
To run the matching process for all mentees in the `input/` directory:
**First-time run or complete re-processing:**
```bash
uv run main.py --mentees input/ --mentors data/pdfs/ --num_mentors 10
uv run main.py --mentees input/ --mentors data/pdfs/ --num_mentors 10 --overwrite
```

**Run matching when mentor data is already processed:**
If `data/mentor_data.csv` and the FAISS index are already built, you can run matching for new mentees without providing the `--mentors` directory.
```bash
uv run main.py --mentees input/ --num_mentors 10
```

### Output Format

The results are saved in `output/best_matches.json`. The output is a list, where each item represents a mentee and their ranked list of mentor matches.

```json
[
{
"mentee_name": "Mentee",
"mentee_email": "Mentee Email",
"mentee_name": "Individual A",
"mentee_email": "Email",
"mentee_preferences": [
"Team Science (laboratory and clinical collaborations)",
"Translational Research (bench-to-bedside)",
"Lab-based/Bench Research"
"Team Science",
"Translational Research",
"Lab-based Research"
],
"matches": [
{
Expand Down
4 changes: 2 additions & 2 deletions data/mentor_data.csv
Git LFS file not shown
3 changes: 0 additions & 3 deletions data/mentor_data_summaries_ranks.csv

This file was deleted.

3 changes: 0 additions & 3 deletions data/mentor_data_with_summaries.csv

This file was deleted.

Binary file added db/text-embedding-3-large/faiss_index/index.faiss
Binary file not shown.
Binary file added db/text-embedding-3-large/faiss_index/index.pkl
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading