LiuzLab · jjia1 · Jul 19, 2025 · Jul 18, 2025 · Jul 18, 2025 · Jul 18, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,6 @@
 input/
 output/
 data/
-db/
 notebooks/
 simulated_data/
 templates/
@@ -18,3 +17,6 @@ build
 *.egg-info/
 *.csv
 db/old/
+static/
+templates/
+data/old_20250718/
diff --git a/README.md b/README.md
@@ -2,113 +2,91 @@
 
 This project is a comprehensive pipeline designed to match mentees with suitable mentors based on their professional profiles and research interests. It leverages Large Language Models (LLMs) for summarization, evaluation, and vector embeddings to find the best possible matches from a corpus of mentor CVs.
 
-## Dataflow Diagrams
+## Dataflow and Caching
 
-### 1. Data Processing and Indexing (One-Time Setup)
-
-This initial pipeline processes raw mentor CVs, summarizes them, and builds a searchable FAISS vector index. This only needs to be run once or when the mentor pool changes.
+The pipeline is designed to be robust and efficient, using a single `data/mentor_data.csv` file as the source of truth for all mentor information. It intelligently checks the state of this file to avoid re-running expensive processing steps.
 
 ```mermaid
 flowchart LR
-    A["Mentor CVs (PDFs/DOCX)"] --> B["io_utils: load_documents()"];
-    B --> C["text_utils: clean_and_validate_text()"];
-    C --> D["main.py: Caches to mentor_data.csv"];
-    D --> E["batch.py: summarize_cvs()"];
-    E --> F["main.py: Caches to mentor_data_with_summaries.csv"];
-    F --> G["build_index.py: build_index()"];
-    G --> H["utils.py: find_professor_type() & rank_professors()"];
-    H --> I["main.py: Caches to mentor_data_summaries_ranks.csv"];
-    I --> J["build_index.py: Creates FAISS Index"];
-    J --> K[("db/embedding-model-name/index.faiss")];
-```
-
-### 2. Mentee Matching (Per-Mentee Execution)
-
-This pipeline runs for each new mentee to find the best matches from the pre-built index.
-
-```mermaid
-flowchart LR
-    subgraph "Mentee Input"
-        L["Mentee Info (JSON)"] --> M["main.py: Parses JSON"];
-        M --> N["Mentee CV Path & Preferences"];
-    end
-
-    subgraph "Candidate Retrieval & Evaluation"
-        O[("FAISS Index")] --> P["search_candidate_mentors.py"];
-        N --> P;
-        P --> Q["Top-K Similarity Search"];
+    subgraph "Mentor Data Pipeline (Runs only when needed)"
+        A["--mentors dir (PDFs/DOCX)"] --> B{"main.py"};
+        B -- "1. Load/Update" --> C["data/mentor_data.csv"];
+        C -- "2. Check for 'Mentor_Summary' column" --> B;
+        B -- "3. Summarize (if needed)" --> C;
+        C -- "4. Check for 'Rank' column" --> B;
+        B -- "5. Rank (if needed)" --> C;
+        C -- "6. Check for FAISS index" --> B;
+        B -- "7. Build Index (if needed)" --> D[("db/embedding-model/index.faiss")];
     end
 
-    subgraph "LLM-based Re-ranking"
-        Q --> R["evaluate_matches.py: evaluate_pair_with_llm()"];
-        R --> S["evaluate_matches.py: extract_eval_scores_with_llm()"];
-        S --> T["main.py: Sorts by 'Overall Match Quality'"];
+    subgraph "Mentee Matching Pipeline"
+        E["--mentees dir (JSON + CV)"] --> F{"main.py"};
+        D --> F;
+        F --> G[/"output/best_matches.json"/];
     end
-
-    T --> U[/"output/best_matches.json"/];
 ```
 
+-   **Intelligent Caching**: The pipeline checks for the existence of `data/mentor_data.csv` and its columns (`Mentor_Summary`, `Rank`) to determine which steps to run. For example, if the `Mentor_Summary` column is already present, the summarization step is skipped.
+-   **Atomic Writes**: All updates to `data/mentor_data.csv` are performed atomically to prevent data corruption if the script is interrupted.
+
 ## How to Use the Pipeline
 
 ### Mentee Input Data Structure
 
 Before running the matching process, you must structure the mentee input data correctly inside the `input/` directory.
 
 1.  **Create a subdirectory for each mentee.** The name of the subdirectory should be the mentee's email address (e.g., `input/john.doe@email.com/`).
-
 2.  **Inside each mentee's subdirectory, add their CV file(s)** (e.g., `.pdf`, `.docx`).
-
-3.  **Add a JSON file containing the mentee's information.** The script will automatically detect and use the first JSON file it finds in the directory. The filename can be anything, but the content must follow this structure:
+3.  **Add a JSON file containing the mentee's information.** The script uses the first JSON file it finds in the directory. The content must follow this structure:
 
     ```json
     {
       "first_name": "Katelyn",
       "last_name": "Senkus",
-      "role": "Mentee",
       "research_Interest": [
-        "Team Science (laboratory and clinical collaborations)",
-        "Translational Research (bench-to-bedside)",
-        "Lab-based/Bench Research"
+        "Team Science",
+        "Translational Research",
+        "Lab-based Research"
       ],
-      "submissions_files": [
-        "Senkus_CV_3-26-25.docx"
-      ]
+      "submissions_files": ["Senkus_CV_3-26-25.docx"]
     }
     ```
-    -   `first_name`: The mentee's first name.
-    -   `last_name`: The mentee's last name.
-    -   `research_Interest`: A list of strings representing the mentee's research interests, ranked in order of preference.
-    -   `submissions_files`: A list containing the filename of the CV to be used for matching. The script will find this file within the same directory, even if it has a timestamp prefix (e.g., `1743173574187_Senkus_CV_3-26-25.docx`).
 
 ### Running the Pipeline
 
 The entire pipeline is executed from the root directory via the `main.py` script.
 
 #### Command-Line Arguments
 -   `--mentees`: **(Required)** Path to the root directory containing mentee subdirectories (e.g., `input/`).
--   `--mentors`: **(Required)** Path to the root directory containing mentor CVs. The script will search this directory and all its subdirectories.
 -   `--num_mentors`: **(Required)** The number of initial candidates to retrieve from the similarity search for each mentee.
--   `--overwrite`: **(Optional)** A flag to force the script to ignore all cached files and re-run the entire data processing pipeline from scratch.
+-   `--mentors`: **(Optional)** Path to the root directory containing mentor CVs. This is **only required** if `data/mentor_data.csv` does not exist or if you are running with the `--overwrite` flag.
+-   `--overwrite`: **(Optional)** A flag to force the script to re-run the entire data processing pipeline from scratch, deleting all cached data.
+
+#### Examples
 
-#### Example
-To run the matching process for all mentees in the `input/` directory:
+**First-time run or complete re-processing:**
 ```bash
-uv run main.py --mentees input/ --mentors data/pdfs/ --num_mentors 10
+uv run main.py --mentees input/ --mentors data/pdfs/ --num_mentors 10 --overwrite
+```
+
+**Run matching when mentor data is already processed:**
+If `data/mentor_data.csv` and the FAISS index are already built, you can run matching for new mentees without providing the `--mentors` directory.
+```bash
+uv run main.py --mentees input/ --num_mentors 10
 ```
 
 ### Output Format
 
 The results are saved in `output/best_matches.json`. The output is a list, where each item represents a mentee and their ranked list of mentor matches.
-
 ```json
 [
   {
-    "mentee_name": "Mentee",
-    "mentee_email": "Mentee Email",
+    "mentee_name": "Individual A",
+    "mentee_email": "Email",
     "mentee_preferences": [
-      "Team Science (laboratory and clinical collaborations)",
-      "Translational Research (bench-to-bedside)",
-      "Lab-based/Bench Research"
+      "Team Science",
+      "Translational Research",
+      "Lab-based Research"
     ],
     "matches": [
       {

diff --git a/data/mentor_data.csv b/data/mentor_data.csv
diff --git a/data/mentor_data_summaries_ranks.csv b/data/mentor_data_summaries_ranks.csv
diff --git a/data/mentor_data_with_summaries.csv b/data/mentor_data_with_summaries.csv
diff --git a/db/text-embedding-3-large/faiss_index/index.faiss b/db/text-embedding-3-large/faiss_index/index.faiss
diff --git a/db/text-embedding-3-large/faiss_index/index.pkl b/db/text-embedding-3-large/faiss_index/index.pkl
diff --git a/db/text-embedding-3-large/index_summary_above_assistant/index.faiss b/db/text-embedding-3-large/index_summary_above_assistant/index.faiss
diff --git a/db/text-embedding-3-large/index_summary_above_assistant/index.pkl b/db/text-embedding-3-large/index_summary_above_assistant/index.pkl
diff --git a/db/text-embedding-3-large/index_summary_assistant_and_above/index.faiss b/db/text-embedding-3-large/index_summary_assistant_and_above/index.faiss
diff --git a/db/text-embedding-3-large/index_summary_assistant_and_above/index.pkl b/db/text-embedding-3-large/index_summary_assistant_and_above/index.pkl
diff --git a/db/text-embedding-3-large/index_summary_with_metadata/index.faiss b/db/text-embedding-3-large/index_summary_with_metadata/index.faiss
diff --git a/db/text-embedding-3-large/index_summary_with_metadata/index.pkl b/db/text-embedding-3-large/index_summary_with_metadata/index.pkl