diff --git a/.gitignore b/.gitignore
index 47e68ff..3966f47 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,7 +1,6 @@
input/
output/
data/
-db/
notebooks/
simulated_data/
templates/
@@ -18,3 +17,6 @@ build
*.egg-info/
*.csv
db/old/
+static/
+templates/
+data/old_20250718/
diff --git a/README.md b/README.md
index 0e649b6..78b2491 100644
--- a/README.md
+++ b/README.md
@@ -2,52 +2,33 @@
This project is a comprehensive pipeline designed to match mentees with suitable mentors based on their professional profiles and research interests. It leverages Large Language Models (LLMs) for summarization, evaluation, and vector embeddings to find the best possible matches from a corpus of mentor CVs.
-## Dataflow Diagrams
+## Dataflow and Caching
-### 1. Data Processing and Indexing (One-Time Setup)
-
-This initial pipeline processes raw mentor CVs, summarizes them, and builds a searchable FAISS vector index. This only needs to be run once or when the mentor pool changes.
+The pipeline is designed to be robust and efficient, using a single `data/mentor_data.csv` file as the source of truth for all mentor information. It intelligently checks the state of this file to avoid re-running expensive processing steps.
```mermaid
flowchart LR
- A["Mentor CVs (PDFs/DOCX)"] --> B["io_utils: load_documents()"];
- B --> C["text_utils: clean_and_validate_text()"];
- C --> D["main.py: Caches to mentor_data.csv"];
- D --> E["batch.py: summarize_cvs()"];
- E --> F["main.py: Caches to mentor_data_with_summaries.csv"];
- F --> G["build_index.py: build_index()"];
- G --> H["utils.py: find_professor_type() & rank_professors()"];
- H --> I["main.py: Caches to mentor_data_summaries_ranks.csv"];
- I --> J["build_index.py: Creates FAISS Index"];
- J --> K[("db/embedding-model-name/index.faiss")];
-```
-
-### 2. Mentee Matching (Per-Mentee Execution)
-
-This pipeline runs for each new mentee to find the best matches from the pre-built index.
-
-```mermaid
-flowchart LR
- subgraph "Mentee Input"
- L["Mentee Info (JSON)"] --> M["main.py: Parses JSON"];
- M --> N["Mentee CV Path & Preferences"];
- end
-
- subgraph "Candidate Retrieval & Evaluation"
- O[("FAISS Index")] --> P["search_candidate_mentors.py"];
- N --> P;
- P --> Q["Top-K Similarity Search"];
+ subgraph "Mentor Data Pipeline (Runs only when needed)"
+ A["--mentors dir (PDFs/DOCX)"] --> B{"main.py"};
+ B -- "1. Load/Update" --> C["data/mentor_data.csv"];
+ C -- "2. Check for 'Mentor_Summary' column" --> B;
+ B -- "3. Summarize (if needed)" --> C;
+ C -- "4. Check for 'Rank' column" --> B;
+ B -- "5. Rank (if needed)" --> C;
+ C -- "6. Check for FAISS index" --> B;
+ B -- "7. Build Index (if needed)" --> D[("db/embedding-model/index.faiss")];
end
- subgraph "LLM-based Re-ranking"
- Q --> R["evaluate_matches.py: evaluate_pair_with_llm()"];
- R --> S["evaluate_matches.py: extract_eval_scores_with_llm()"];
- S --> T["main.py: Sorts by 'Overall Match Quality'"];
+ subgraph "Mentee Matching Pipeline"
+ E["--mentees dir (JSON + CV)"] --> F{"main.py"};
+ D --> F;
+ F --> G[/"output/best_matches.json"/];
end
-
- T --> U[/"output/best_matches.json"/];
```
+- **Intelligent Caching**: The pipeline checks for the existence of `data/mentor_data.csv` and its columns (`Mentor_Summary`, `Rank`) to determine which steps to run. For example, if the `Mentor_Summary` column is already present, the summarization step is skipped.
+- **Atomic Writes**: All updates to `data/mentor_data.csv` are performed atomically to prevent data corruption if the script is interrupted.
+
## How to Use the Pipeline
### Mentee Input Data Structure
@@ -55,30 +36,21 @@ flowchart LR
Before running the matching process, you must structure the mentee input data correctly inside the `input/` directory.
1. **Create a subdirectory for each mentee.** The name of the subdirectory should be the mentee's email address (e.g., `input/john.doe@email.com/`).
-
2. **Inside each mentee's subdirectory, add their CV file(s)** (e.g., `.pdf`, `.docx`).
-
-3. **Add a JSON file containing the mentee's information.** The script will automatically detect and use the first JSON file it finds in the directory. The filename can be anything, but the content must follow this structure:
+3. **Add a JSON file containing the mentee's information.** The script uses the first JSON file it finds in the directory. The content must follow this structure:
```json
{
"first_name": "Katelyn",
"last_name": "Senkus",
- "role": "Mentee",
"research_Interest": [
- "Team Science (laboratory and clinical collaborations)",
- "Translational Research (bench-to-bedside)",
- "Lab-based/Bench Research"
+ "Team Science",
+ "Translational Research",
+ "Lab-based Research"
],
- "submissions_files": [
- "Senkus_CV_3-26-25.docx"
- ]
+ "submissions_files": ["Senkus_CV_3-26-25.docx"]
}
```
- - `first_name`: The mentee's first name.
- - `last_name`: The mentee's last name.
- - `research_Interest`: A list of strings representing the mentee's research interests, ranked in order of preference.
- - `submissions_files`: A list containing the filename of the CV to be used for matching. The script will find this file within the same directory, even if it has a timestamp prefix (e.g., `1743173574187_Senkus_CV_3-26-25.docx`).
### Running the Pipeline
@@ -86,29 +58,35 @@ The entire pipeline is executed from the root directory via the `main.py` script
#### Command-Line Arguments
- `--mentees`: **(Required)** Path to the root directory containing mentee subdirectories (e.g., `input/`).
-- `--mentors`: **(Required)** Path to the root directory containing mentor CVs. The script will search this directory and all its subdirectories.
- `--num_mentors`: **(Required)** The number of initial candidates to retrieve from the similarity search for each mentee.
-- `--overwrite`: **(Optional)** A flag to force the script to ignore all cached files and re-run the entire data processing pipeline from scratch.
+- `--mentors`: **(Optional)** Path to the root directory containing mentor CVs. This is **only required** if `data/mentor_data.csv` does not exist or if you are running with the `--overwrite` flag.
+- `--overwrite`: **(Optional)** A flag to force the script to re-run the entire data processing pipeline from scratch, deleting all cached data.
+
+#### Examples
-#### Example
-To run the matching process for all mentees in the `input/` directory:
+**First-time run or complete re-processing:**
```bash
-uv run main.py --mentees input/ --mentors data/pdfs/ --num_mentors 10
+uv run main.py --mentees input/ --mentors data/pdfs/ --num_mentors 10 --overwrite
+```
+
+**Run matching when mentor data is already processed:**
+If `data/mentor_data.csv` and the FAISS index are already built, you can run matching for new mentees without providing the `--mentors` directory.
+```bash
+uv run main.py --mentees input/ --num_mentors 10
```
### Output Format
The results are saved in `output/best_matches.json`. The output is a list, where each item represents a mentee and their ranked list of mentor matches.
-
```json
[
{
- "mentee_name": "Mentee",
- "mentee_email": "Mentee Email",
+ "mentee_name": "Individual A",
+ "mentee_email": "Email",
"mentee_preferences": [
- "Team Science (laboratory and clinical collaborations)",
- "Translational Research (bench-to-bedside)",
- "Lab-based/Bench Research"
+ "Team Science",
+ "Translational Research",
+ "Lab-based Research"
],
"matches": [
{
diff --git a/data/mentor_data.csv b/data/mentor_data.csv
index 9a97f33..2231085 100644
--- a/data/mentor_data.csv
+++ b/data/mentor_data.csv
@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
-oid sha256:0e5e112432c4014964a570ffef4a017174b2911bfcd2d4071cde242566b85614
-size 18878055
+oid sha256:254e34123ded9f49af830b6c616a8498b9a92c06658433cc03c31cdd95aa67cc
+size 20414289
diff --git a/data/mentor_data_summaries_ranks.csv b/data/mentor_data_summaries_ranks.csv
deleted file mode 100644
index ed14e96..0000000
--- a/data/mentor_data_summaries_ranks.csv
+++ /dev/null
@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d7905d93828945519aaadf5ff8071e4d0f5fb664872292050338e2a9edee3e6a
-size 20410571
diff --git a/data/mentor_data_with_summaries.csv b/data/mentor_data_with_summaries.csv
deleted file mode 100644
index ee9d886..0000000
--- a/data/mentor_data_with_summaries.csv
+++ /dev/null
@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:f43438831a92f25fa27026e54ac141c3fa2a4964891c6510ca9bae58ef3848cc
-size 20379187
diff --git a/db/text-embedding-3-large/faiss_index/index.faiss b/db/text-embedding-3-large/faiss_index/index.faiss
new file mode 100644
index 0000000..1df94d2
Binary files /dev/null and b/db/text-embedding-3-large/faiss_index/index.faiss differ
diff --git a/db/text-embedding-3-large/faiss_index/index.pkl b/db/text-embedding-3-large/faiss_index/index.pkl
new file mode 100644
index 0000000..137ed35
Binary files /dev/null and b/db/text-embedding-3-large/faiss_index/index.pkl differ
diff --git a/db/text-embedding-3-large/index_summary_above_assistant/index.faiss b/db/text-embedding-3-large/index_summary_above_assistant/index.faiss
new file mode 100644
index 0000000..8cf4378
Binary files /dev/null and b/db/text-embedding-3-large/index_summary_above_assistant/index.faiss differ
diff --git a/db/text-embedding-3-large/index_summary_above_assistant/index.pkl b/db/text-embedding-3-large/index_summary_above_assistant/index.pkl
new file mode 100644
index 0000000..66d2ae7
Binary files /dev/null and b/db/text-embedding-3-large/index_summary_above_assistant/index.pkl differ
diff --git a/db/text-embedding-3-large/index_summary_assistant_and_above/index.faiss b/db/text-embedding-3-large/index_summary_assistant_and_above/index.faiss
new file mode 100644
index 0000000..d6e252d
Binary files /dev/null and b/db/text-embedding-3-large/index_summary_assistant_and_above/index.faiss differ
diff --git a/db/text-embedding-3-large/index_summary_assistant_and_above/index.pkl b/db/text-embedding-3-large/index_summary_assistant_and_above/index.pkl
new file mode 100644
index 0000000..adcca70
Binary files /dev/null and b/db/text-embedding-3-large/index_summary_assistant_and_above/index.pkl differ
diff --git a/db/text-embedding-3-large/index_summary_with_metadata/index.faiss b/db/text-embedding-3-large/index_summary_with_metadata/index.faiss
new file mode 100644
index 0000000..530ad94
Binary files /dev/null and b/db/text-embedding-3-large/index_summary_with_metadata/index.faiss differ
diff --git a/db/text-embedding-3-large/index_summary_with_metadata/index.pkl b/db/text-embedding-3-large/index_summary_with_metadata/index.pkl
new file mode 100644
index 0000000..96bfff2
Binary files /dev/null and b/db/text-embedding-3-large/index_summary_with_metadata/index.pkl differ
diff --git a/main.py b/main.py
index 2182947..eefc832 100644
--- a/main.py
+++ b/main.py
@@ -2,8 +2,6 @@
import asyncio
import json
import os
-import time
-
import pandas as pd
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
@@ -12,8 +10,6 @@
from src.config.paths import (
INDEX_SUMMARY_WITH_METADATA,
PATH_TO_MENTOR_DATA,
- PATH_TO_MENTOR_DATA_RANKED,
- PATH_TO_SUMMARY,
ROOT_DIR,
)
from src.eval.evaluate_matches import (
@@ -24,6 +20,14 @@
from src.processing.io_utils import load_document, load_documents
from src.retrieval.build_index import build_index
from src.retrieval.search_candidate_mentors import search_candidate_mentors
+from src.utils import find_professor_type, rank_professors
+
+
+def safe_save_csv(df, path):
+ """Saves a DataFrame to a CSV file atomically using tabs as separators."""
+ temp_path = path + ".tmp"
+ df.to_csv(temp_path, index=False, sep="\t")
+ os.replace(temp_path, path)
async def process_single_mentee(
@@ -63,13 +67,11 @@ async def process_single_mentee(
}
)
- # Sort the matches based on the 'Overall Match Quality' score in descending order
evaluated_matches.sort(
key=lambda x: x["Criterion Scores"].get("Overall Match Quality", 0),
reverse=True,
)
- # Construct the full mentee result object
mentee_name = f"{mentee_data.get('first_name')} {mentee_data.get('last_name')}"
mentee_email = os.path.basename(os.path.dirname(mentee_cv_path))
@@ -81,62 +83,68 @@ async def process_single_mentee(
}
-async def main(mentee_dir, mentor_resume_dir, num_mentors, overwrite=False):
- # --- Step 1: Process raw mentor resumes into a CSV file ---
+async def main(
+ mentee_dir, mentor_resume_dir, num_mentors, overwrite=False, output_dir=None
+):
+ # --- Step 1: Initial Data Loading ---
if overwrite or not os.path.exists(PATH_TO_MENTOR_DATA):
- print("Step 1: Processing mentor resumes into CSV...")
- if not os.path.exists(mentor_resume_dir):
- raise FileNotFoundError(
- f"Mentor resume directory not found: {mentor_resume_dir}"
+ print("Step 1: Processing mentor resumes from source...")
+ if not mentor_resume_dir or not os.path.exists(mentor_resume_dir):
+ raise ValueError(
+ "--mentors directory is required when running with --overwrite or when mentor_data.csv does not exist."
)
-
docs = load_documents(mentor_resume_dir)
if not docs:
raise ValueError(
f"No documents (PDF, DOCX, TXT) found in {mentor_resume_dir}"
)
-
df = pd.DataFrame(docs, columns=["Mentor_Profile", "Mentor_Data"])
- df.to_csv(PATH_TO_MENTOR_DATA, index=False)
+ safe_save_csv(df, PATH_TO_MENTOR_DATA)
print(f"Successfully created raw mentor data CSV at: {PATH_TO_MENTOR_DATA}")
else:
- print(
- f"Skipping Step 1: Raw mentor data CSV already exists at {PATH_TO_MENTOR_DATA}"
- )
+ print(f"Skipping Step 1: Using existing mentor data at {PATH_TO_MENTOR_DATA}")
+ df = pd.read_csv(PATH_TO_MENTOR_DATA, sep="\t")
- # --- Step 2: Summarize the mentor data ---
- if overwrite or not os.path.exists(PATH_TO_SUMMARY):
+ # --- Step 2: Summarize Mentor Data ---
+ if "Mentor_Summary" not in df.columns:
print("\nStep 2: Summarizing mentor data...")
- await summarize_cvs(PATH_TO_MENTOR_DATA, PATH_TO_SUMMARY)
+ df = await summarize_cvs(df)
+ safe_save_csv(df, PATH_TO_MENTOR_DATA)
+ print(f"Successfully added summaries to {PATH_TO_MENTOR_DATA}")
else:
- print(
- f"Skipping Step 2: Summarized mentor data already exists at {PATH_TO_SUMMARY}"
- )
+ print("Skipping Step 2: Mentor summaries already exist.")
+
+ # --- Step 3: Rank Mentors ---
+ if "Rank" not in df.columns:
+ print("\nStep 3: Ranking mentors...")
+ df["Professor_Type"] = [
+ find_professor_type(text) for text in df["Mentor_Data"].fillna("")
+ ]
+ df = rank_professors(df)
+ safe_save_csv(df, PATH_TO_MENTOR_DATA)
+ print(f"Successfully added ranks to {PATH_TO_MENTOR_DATA}")
+ else:
+ print("Skipping Step 3: Mentor ranks already exist.")
- # --- Step 3: Build the FAISS index and ranked data file ---
- # This step runs if the index itself is missing, ensuring it's created
- # even if the intermediate ranked data file exists.
+ # --- Step 4: Build FAISS Index ---
if overwrite or not os.path.exists(INDEX_SUMMARY_WITH_METADATA):
- print("\nStep 3: Building FAISS index and ranking mentors...")
- build_index()
+ print("\nStep 4: Building FAISS index...")
+ build_index(df)
else:
- print(
- f"Skipping Step 3: FAISS index already exists at {INDEX_SUMMARY_WITH_METADATA}"
- )
+ print("Skipping Step 4: FAISS index already exists.")
- # --- Step 4: Load the FAISS index for matching ---
+ # --- Step 5: Load FAISS Index ---
print("\nLoading FAISS index for matching...")
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
if not os.path.exists(INDEX_SUMMARY_WITH_METADATA):
raise FileNotFoundError(
- f"FAISS index not found at {INDEX_SUMMARY_WITH_METADATA}. Please run the script with --overwrite."
+ f"FAISS index not found at {INDEX_SUMMARY_WITH_METADATA}. Please run the script again."
)
-
vector_store = FAISS.load_local(
INDEX_SUMMARY_WITH_METADATA, embeddings, allow_dangerous_deserialization=True
)
- # --- Step 5: Process each mentee ---
+ # --- Step 6: Process Mentees ---
print("\nProcessing mentees...")
all_matches = []
for mentee_subdir in os.listdir(mentee_dir):
@@ -146,35 +154,24 @@ async def main(mentee_dir, mentor_resume_dir, num_mentors, overwrite=False):
f for f in os.listdir(mentee_subdir_path) if f.lower().endswith(".json")
]
if not json_files:
- print(
- f"No JSON file found for mentee in {mentee_subdir_path}. Skipping."
- )
continue
- # Use the first JSON file found
mentee_json_path = os.path.join(mentee_subdir_path, json_files[0])
-
with open(mentee_json_path, "r") as f:
mentee_data = json.load(f)
mentee_preferences = mentee_data.get("research_Interest", [])
cv_filename_base = mentee_data.get("submissions_files", [None])[0]
-
if not cv_filename_base:
- print(f"No CV filename found in {mentee_json_path}. Skipping.")
continue
- # Find the actual CV file in the directory, ignoring the timestamp prefix
mentee_cv_path = None
for f in os.listdir(mentee_subdir_path):
if f.endswith(cv_filename_base):
mentee_cv_path = os.path.join(mentee_subdir_path, f)
- break # Use the first match
+ break
if not mentee_cv_path:
- print(
- f"CV file '{cv_filename_base}' not found in {mentee_subdir_path}. Skipping."
- )
continue
print(
@@ -190,38 +187,52 @@ async def main(mentee_dir, mentor_resume_dir, num_mentors, overwrite=False):
if mentee_results:
all_matches.append(mentee_results)
- # --- Step 6: Save the final JSON output ---
- output_dir = os.path.join(ROOT_DIR, "output")
+ # --- Step 7: Save Final Output ---
+ if output_dir is None:
+ output_dir = os.path.join(ROOT_DIR, "output")
os.makedirs(output_dir, exist_ok=True)
json_output_path = os.path.join(output_dir, "best_matches.json")
-
with open(json_output_path, "w") as f:
json.dump(all_matches, f, indent=4)
-
print(f"\nAll mentee matches saved to {json_output_path}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Mentor Matching Pipeline")
parser.add_argument(
- "--mentees", required=True, help="Path to the directory containing mentee CVs."
+ "--mentees",
+ required=True,
+ help="Path to the directory containing mentee subdirectories.",
)
parser.add_argument(
"--mentors",
- required=True,
- help="Path to the directory containing mentor resumes.",
+ required=False,
+ help="Path to the directory containing mentor resumes. Only required if mentor_data.csv doesn't exist or --overwrite is used.",
)
parser.add_argument(
"--num_mentors",
type=int,
required=True,
- help="Number of desired matches (length of table)",
+ help="Number of desired matches for evaluation.",
)
parser.add_argument(
"--overwrite",
action="store_true",
- help="Overwrite existing cached files and re-run the full data processing pipeline.",
+ help="Force re-processing of all mentor data from scratch.",
+ )
+ parser.add_argument(
+ "--output_dir",
+ default=None,
+ help="Directory to save the final JSON output. Defaults to 'output/' in the project root.",
)
args = parser.parse_args()
- asyncio.run(main(args.mentees, args.mentors, args.num_mentors, args.overwrite))
+ asyncio.run(
+ main(
+ args.mentees,
+ args.mentors,
+ args.num_mentors,
+ args.overwrite,
+ args.output_dir,
+ )
+ )
diff --git a/notebooks/measure_accuracy_per_apporach.ipynb b/notebooks/measure_accuracy_per_apporach.ipynb
deleted file mode 100644
index 2d92e4b..0000000
--- a/notebooks/measure_accuracy_per_apporach.ipynb
+++ /dev/null
@@ -1,492 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "3513ec81",
- "metadata": {},
- "source": [
- "# Notebook for k selection\n",
- "\n",
- "This notebook report an exploratory data analysis to get the best k-value"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 77,
- "id": "ae51390f-8684-495f-893a-b9f4fa94ccd1",
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "from langchain_community.vectorstores import FAISS\n",
- "from langchain_openai import OpenAIEmbeddings\n",
- "from dotenv import load_dotenv\n",
- "import pandas as pd\n",
- "from tqdm import tqdm\n",
- "import numpy as np\n",
- "import seaborn as sns\n",
- "import pandas as pd\n",
- "import matplotlib.pyplot as plt\n",
- "\n",
- "load_dotenv()\n",
- "OPENAI_KEY = os.getenv(\"OPENAI_API_KEY\")\n",
- "MODEL_NAME = \"gpt-3.5-turbo-0125\" # will change it :)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0ecb6064",
- "metadata": {},
- "source": [
- "## Loading data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 78,
- "id": "55683d9d-a2d5-4aab-9d22-34aa05b7023f",
- "metadata": {},
- "outputs": [],
- "source": [
- "db = FAISS.load_local(\"../db/index_summary/\", OpenAIEmbeddings(), allow_dangerous_deserialization=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 79,
- "id": "f69e9d38-6443-4072-a951-30563666cd31",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "1174"
- ]
- },
- "execution_count": 79,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "db.index.ntotal"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 80,
- "id": "a1aff13b-61e1-4e00-adcb-f49b5a43e0aa",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Index(['Mentor Profile', 'Mock Student CV', 'PDF Text', 'Mentor_Summary',\n",
- " 'Mentee_Summary'],\n",
- " dtype='object')"
- ]
- },
- "execution_count": 80,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "df = pd.read_csv(\"../simulated_data/mentor_student_cvs_with_summaries_final.csv\")\n",
- "df.columns"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 81,
- "id": "70780e52-c3c9-42fe-b29f-69bfaea00657",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Johnathan A. Doe is a graduate of Houston University where he obtained his Bachelor's Science in Biology. He served as a Research Assistant in the Department of Pathology of the same university, where he displayed exceptional expertise in molecular biology and microbiota studies. Particularly, Doe has made significant contributions in microbial culture research.\n",
- "\n",
- "During his tenure, Doe has conducted experiments on bacterial culture pH readouts using UV-Vis absorption spectrophotometry, showcasing his skills in Molecular & Cellular Biology Techniques as well as data analysis using statistical software like R and SPSS. His research has resulted in notable publications such as \"Analyzing Microbial Culture pH through UV-Vis Absorption Spectrophotometry,\" \"Effectiveness of Molecular Models in Understanding Protein-Ligand Interactions,\" and \"Investigating the Role of Microbiota on Human Immune Responses.\" \n",
- "\n",
- "More than his research profile, Doe is actively involved in community health activities, volunteering in Houston Community Health Clinic and coordinating bioresearch events in his university. His diverse skillset, community involvement, and noteworthy publishing record highlight his suitability for collaboration or mentorship.\n"
- ]
- }
- ],
- "source": [
- "print(df['Mentee_Summary'][0])"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6e82f58c",
- "metadata": {},
- "source": [
- "## Example of FAISS similarity query"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 82,
- "id": "64d18699-242f-423f-af58-e525e9f17efe",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[(Document(page_content=\"8183325.pdf\\n=====\\nAnthony Haag is an Assistant Professor at Baylor College of Medicine, working within the Department of Pathology & Immunology. His main research interests encompass the areas of Tandem Mass Spectrometry, Liquid Chromatography, Gastrointestinal Microbiome, and Sterilization, among others. Some of his notable achievements and significant contributions to the scientific community are evident in his numerous published research papers. His work revolves around the exploration of the gut-brain axis using various LC-MS/MS-based targeted metabolomics, as well as the investigation of the mammalian gut microbiome and its influence on various physiological functions. Haag's research also includes the study of neurotransmitter profiles and their alterations in relation to the presence of specific microbiota such as Bifidobacterium dentium. His work is widely cited and discussed, reflecting its relevance and impact in the field of Pathology and Immunology. This professional profile positions him as a suitable candidate for collaboration or mentorship in his areas of expertise, particularly concerning gut microbiota, the gut-brain axis, and relevant analytical methodologies.\"),\n",
- " 0.31570554),\n",
- " (Document(page_content=\"267007.pdf\\n=====\\nKendal Hirschi is a distinguished professor affiliated with Baylor College of Medicine's Department of Pediatrics and the Department of Molecular & Human Genetics. Dr. Hirschi's primary research interests lie in exploring the influence of plant exosomes and MicroRNAs on the regulation of the microbiome and intestinal homeostasis, nutritional impacts of modifying calcium partitioning, and plant ion homeostasis. He has extensively published research on these subjects and made notable contributions, such as his investigations into the nutritional control with regulatory RNAs and the role of genetically modified plants. His work has added substantially to the understanding of diet and immunity interactions. Given his research experience and results, Dr. Hirschi would be an excellent candidate for collaboration or mentorship, especially in the fields of molecular genetics, microbial-immune interactions, and nutritional science.\"),\n",
- " 0.33454758),\n",
- " (Document(page_content=\"43521560.pdf\\n=====\\nIsaiah Gonzalez is an Assistant Professor at the Baylor College of Medicine, specifically within the Department of Pediatrics. His main research interests and areas of expertise revolve around mass spectrometry, virulence factors, anti-infective agents, and notably, Methicillin-Resistant Staphylococcus aureus (MRSA). His work is primarily related to the field of Molecular Cell Proteomics. One significant publication of his involves the usage of mass spectrometry-based molecular networking to capture Phenol soluble modulin (PSM) variants of community-associated MRSA. This showcases Gonzalez's proficiency in cutting-edge molecular techniques and his contributions towards unravelling the behaviour of complex drug-resistant bacteria. His research could serve as foundation for collaborations or mentorships in fields involving infectious diseases, antibacterial resistance, and clinical microbiology.\"),\n",
- " 0.33617258),\n",
- " (Document(page_content=\"24845306.pdf\\n=====\\nMichael Curtis is currently serving as an Assistant Professor at the Baylor College of Medicine, specifically within the Department of Pediatrics. His main research interests are concentrated on Borrelia burgdorferi and Lyme Disease, as well as Relapsing Fever. Curtis's work significantly contributes to the understanding of the tick-mammalian transmission cycle and the characterization of various immunological responses related to these diseases. He has published extensively, with notable works found in Ticks Tick Borne Dis, Front Cell Infect Microbiol, Microbiol Spectr, and Infect Immun among other journals. Among his achievements is the identification of important amino acid domains of Borrelia burgdorferi P66 and the characterization of the Immunological Responses to Borrelia Immunogenic Protein A (BipA). Curtis has also discovered and studied factors influencing disease resistance and stress in wild-type and Delta p66 Borrelia burgdorferi. His significant contributions to the field make him an ideal candidate for collaboration and mentorship in microbiology and pediatrics.\"),\n",
- " 0.3480829),\n",
- " (Document(page_content=\"268217.pdf\\n=====\\nRuth Luna is an associate professor in the Department of Pathology & Immunology at Baylor College of Medicine. Her principal areas of research lie in the study of microbiota, the gastrointestinal microbiome, and Pseudomonas aeruginosa; her work often relates to gastrointestinal tract health and diseases. Luna's prominent publications and studies highlight her expertise in these areas and its implications for conditions like cystic fibrosis and functional abdominal pain in children. She demonstrated her research aptitude through her study of the upper airway microbiome in Hispanic children with cystic fibrosis and by examining the long-term sex-dependent effects of neonatal antibiotics on the enteric nervous system. Moreover, Luna's research into factors influencing the abundance of intestinal short chain fatty acids and neurotransmitters is noteworthy. Her interest in the impact of the gut microbiome on conditions such as adolescent depression and Rett syndrome further exemplify her extensive contributions to the field. Luna's pertinent research makes her a compelling candidate for collaboration and mentorship opportunities in her field.\\n\"),\n",
- " 0.3481189),\n",
- " (Document(page_content='24845167.pdf\\n=====\\nDr. Denver Niles is an Assistant Professor at Baylor College of Medicine under the Department of Pathology & Immunology. His research work revolves around metagenomics, high-throughput nucleotide sequencing, communicable diseases, and specific conditions such as suppurative thyroiditis and Candida tropicalis. Niles has several notable publications which revolve around using plasma cell-free metagenomic next-generation sequencing in diagnosing infectious diseases, and has contributed to systematic review and meta-analysis in this field. He has also conducted research on pediatric infectious disease, specifically, the clinical impact of plasma metagenomics on a large pediatric cohort. Furthermore, he has worked on a retrospective review related to respiratory syncytial virus infection in pediatric patients. His collaborations with a variety of researchers, including those from his department, like Cameron Brown, Ashley Holloman, Niveen Issaq, and Haley Streff, indicate that he may be suitable for partnership or mentorship opportunities.\\n'),\n",
- " 0.34998876),\n",
- " (Document(page_content=\"40674831.pdf\\n=====\\nDuc Nguyen is an Assistant Professor in the Department of Pediatrics at Baylor College of Medicine. His primary research interests lie in the field of Transplantation, Pediatrics, and Infectious diseases, specifically focusing on Tuberculosis and Kidney Transplantation. He has an extensive list of publications indicative of his contributions in this area; some of his most notable work revolves around graft microthrombus formation in post-reperfusion biopsies, myocardial remodeling in chronic isolated aortic and mitral regurgitation, and cardiac remodeling influences on clinical outcomes in patients with aortic regurgitation. A major part of his research is also targeted towards artificial intelligence in digital pathology, the effectiveness of plasma cell therapy, desensitization regimens in heart transplant candidates, and pediatric palliative care telehealth. These contributions highlight his suitability for collaboration or mentorship, particularly for those interested in medical research involving transplantation, pediatrics, infectious diseases, and innovative technologies in pathology. Dr. Nguyen's collaborative work with numerous co-authors signifies his ability to work in team-oriented environments, another positive attribute for potential collaborations or mentorships.\"),\n",
- " 0.3527155),\n",
- " (Document(page_content=\"267606.pdf\\n=====\\nDr. Numan Oezguen is an Assistant Professor at Baylor College of Medicine in the Department of Pathology & Immunology. He focuses his research on various aspects of molecular biosciences and immunology, with extensive experience in areas such as Multiple Sclerosis, B-Lymphocyte Subsets, Microbiota, and Behcet Syndrome. Dr. Oezguen's body of work contains numerous noteworthy publications, demonstrating a deep commitment to innovative research. One highlight includes his research on identifying region-specific allergy sensitization clusters to optimize diagnosis and reduce medical costs. Dr. Oezguen has also significantly contributed to the field of microbial genomics by studying the systems biology impact on Clostridioides difficile and the effect of dietary microRNAs on the gut microbiome. Another striking publication revolves around the role of a ClC transporter in modulating histidine catabolism in Lactobacillus reuteri by altering the intracellular pH and membrane potential. His innovation and expertise make him a strong candidate for collaboration and mentorship in molecular biosciences and immunology.\\n\"),\n",
- " 0.3540028),\n",
- " (Document(page_content='266659.pdf\\n=====\\nThe individual\\'s name is Mary Paul. She is a Professor at Baylor College of Medicine, specifically in the Department of Pediatrics. Her main areas of research interest and expertise include HIV infections, Anti-HIV agents, Infectious Disease Transmission, Vertical Drug Resistance, and Viral HIV-1. As a Principal Investigator, Paul made notable contributions in Adolescent Medicine Trials at Baylor College of Medicine, sponsored by NIH. \\n\\nPaul\\'s notable achievements include several publications in the research field, especially in the study of HIV. Some of her notable works include \"A Case of in utero Transmission of Drug-resistant HIV in the United States\", \"High Prevalence of Anal High-Grade Squamous Intraepithelial Lesions, and Prevention Through Human Papillomavirus Vaccination, in Young Men Who Have Sex With Men Living With Human Immunodeficiency Virus\", and \"Pharmacokinetics of darunavir and cobicistat in pregnant and postpartum women with HIV\". \\n\\nApart from her major focus on HIV research, Paul has also worked on community-based, nurse-led HIV prevention trials with homeless youth. This places her as a probable candidate for collaboration under youth-oriented HIV prevention and mentorship programs. Her efforts and contributions have significantly advanced knowledge in these key areas, making her an expert in her respective field.\\n'),\n",
- " 0.35407656),\n",
- " (Document(page_content='41156379.pdf\\n=====\\nMichelle Nguyen is an assistant professor in the Department of Pediatrics at Baylor College of Medicine. Her primary research interests include drug discovery, specifically in the realm of small molecule libraries and ubiquitin-protein ligases. Nguyen has a distinct focus on the epigenetic factor UHRF1 and endometriosis resection using robotic single-site surgery and Firefly technology. Notable achievements include her work on the discovery of small molecules targeting the tandem tudor domain, as well as research into the accuracy of postoperative risk scores for survival prediction in Mechanically Assisted Circulatory Support Profile 1 continuous-flow left ventricular assist device recipients. As such, Nguyen brings a wide breadth of expertise to her field, and would be an excellent candidate for collaboration or mentorship opportunities in pediatric medicine and drug discovery.'),\n",
- " 0.35428685),\n",
- " (Document(page_content='45080804.pdf\\n=====\\nMing Jiang is currently an Assistant Professor at the Department of Pathology & Immunology, Baylor College of Medicine. His main research interests include the study of osteogenesis imperfecta, the role of nitric oxide in bone development, and various genetic disorders related to skeletal development. He has made substantial contributions in these areas, having been a part of several collaborations that resulted in multiple publications, including work on mouse models for genetic conditions like lysinuric protein intolerance and osteogenesis imperfecta, investigations on the role of nitric oxide in bone anabolism and lung alveolarization, and studies on how molecular alterations affect classic Ehlers-Danlos syndrome. He has also been involved in a notable study investigating argininosuccinate lyase deficiency leading to chronic liver disease. In addition to his scientific research, Jiang is recognized for his mentorship and collaboration skills based on his extensive network of co-authors and collaborators. His proficiency in addressing complex scientific questions and his dedication to advancing our understanding of bone disorders make him an excellent candidate for research collaboration or mentorship.'),\n",
- " 0.35565436),\n",
- " (Document(page_content='265805.pdf\\n=====\\nMichael Scheurer is a distinguished professor at the Baylor College of Medicine. He holds a multitude of roles across several departments including Pediatrics, Medicine, Molecular Virology & Microbiology, and the Duncan Cancer Center. His primary research interests revolve around understanding cancer and its various aspects, with specific focus on Pediatric HIV/AIDS, Infection-Related Malignancies, acute lymphoblastic leukemia, and cervical dysplasia. A significant contribution is his role as the Principal Investigator on numerous NIH-funded projects studying various subjects including survivorship and access to care for Latinos (SALUD), admixture analysis of acute lymphoblastic leukemia in African American children (ADMIRAL Study), and molecular epidemiology of Langerhans Cell Histiocytosis. He has also been involved as a Co-Principal Investigator in studies examining phenotype-genotype associations with symptoms during childhood leukemia treatment, and the impact of infection and inflammation in adult Gliomas. He has numerous publications related to these research areas and has made significant contributions to understanding these complex medical conditions.'),\n",
- " 0.35601702),\n",
- " (Document(page_content='268785.pdf\\n=====\\nJonathon McNeil is an associate professor at the Baylor College of Medicine, operating from the Department of Pediatrics. His main research interests are centered on the impact of healthcare exposure on colonization with antiseptic tolerant staphylococci in children, and resistance of Staphylococcus aureus to topical antimicrobial agents. He\\'s held the position of Principal Investigator for related NIH-funded projects. His notable publications cover topics like \"The Impact of Healthcare Exposure on Colonization with Antiseptic Tolerant Staphylococci in Children\", \"Staphylococcus aureus resistance to topical antimicrobial agents\" and \"Predictive Factors to Guide Empiric Antimicrobial Therapy of Acute Hematogenous Osteomyelitis in Children\". Other significant contributions include his research into Methicillin-Resistant Staphylococcus aureus (MRSA), bacteremia, and numerous other infections. He holds a prominent role within the research field, having close collaboration with a considerable number of co-authors. His work has added notable depth to the understanding of Staphylococcal infections and MRSA.'),\n",
- " 0.36245987),\n",
- " (Document(page_content=\"266540.pdf\\n=====\\nProfessor Roger Rossen is associated with the Baylor College of Medicine, affiliated with both the Department of Pathology & Immunology and the Department of Medicine Division. His main research interests and areas of expertise are indicated as being deeply involved in studies of innate immunity in HIV infection, monocyte HIV 1 infection and neurological function, complement and leukocytes in myocardial infarction, and antigen-antibody complexes in cancer patients. \\n\\nRossen's notable achievements include multiple roles as a principal investigator in NIH research activities and fundings. His publication topics range from exploring the relationship between smoking and emphysema, the benefits of antifungal therapy in asthmatic patients, to the development of IgM anti-cocaine antibodies in habitual cocaine users. Additionally, he has been part of a double-blind, placebo-controlled efficacy trial for a cocaine vaccine treatment.\\n\\nRossen's work contributes significantly to our understanding in immunology, allergy, and rheumatology, and positions him as a thought leader in these specialty areas, demonstrating his suitability for collaboration or mentorship prospects. His research findings have the potential to immensely impact the future direction of biomedical research and treatment development.\"),\n",
- " 0.3646026),\n",
- " (Document(page_content='1826853.pdf\\n=====\\nDr. Cameron Brown is an Assistant Professor at Baylor College of Medicine, specializing in the Department of Pathology & Immunology and the Department of Pediatrics Division Pediatrics-Tropical Medicine. His primary research interests lie in the study of Communicable Diseases, Beta-Lactam Resistance, and Escherichia Coli Proteins, with a special focus on matters related to wound infection, infectious diseases management, and antibiotic prescription in wound care settings. \\n\\nDr. Brown has an impressive range of publications under his belt, including significant contributions to research on the identification of filamentous fungi, the RNA detection systems in SARS-CoV-2, and the molecular detection of SARS-CoV-2 in children, among others. He has also worked notably on studies centered around bacterial drug resistance, specifically the evolution of beta-lactamases, which form a key part of his research. \\n\\nCollaborating with other experts like Timothy Palzkill and Yuriko Fukuta, Dr. Brown expands his professional network, promoting a multidisciplinary approach in his research. He is also involved in various research teams, such as the CDC Severe Monkeypox Investigations Team, further demonstrating his commitment to pediatric and infectious disease research. These contributions and his expansive professional profile make Dr. Brown a valuable asset in these specialized research areas, making him an excellent candidate for collaboration or mentorship.'),\n",
- " 0.3663271),\n",
- " (Document(page_content='16943340.pdf\\n=====\\nHeather Moore is an Assistant Professor at the Baylor College of Medicine, specifically in the Department of Pediatrics. Her main area of interest and expertise lies in the field of Pediatric Medicine, with a heavy focus on studying Staphylococcus aureus persistent in pediatric patients, exploring the gene expression profiles of peripheral blood mononuclear cells. She showcases notable achievements like her contribution to Rudolph\\'s Pediatrics 23rd Edition Self-Assessment and Board Review in 2022. Other significant recognition includes her involvement in Characterization of peripheral blood mononuclear cells gene expression profiles of pediatric Staphylococcus aureus persistent and non-carriers in Microbe Infect journal, and her contribution to the articles “Maltreatment of Children and Youth with Special Healthcare Needs” and \"Threats to the Medically Complex Child\". Moore\\'s contributions to the field of Pediatrics, notably to the understanding and treatment of complex medical conditions in children, makes her an excellent potential collaborator or mentor in related fields.'),\n",
- " 0.36791736),\n",
- " (Document(page_content='270201.pdf\\n=====\\nThe individual is Professor Mohan Pammi, based at the Baylor College of Medicine in the Department of Pediatrics. Trained in India, the United Kingdom, and the US, Pammi has developed expertise in the field of Neonatology, with a specific focus on neonatal infections and the developing microbiome in preterm infants. His research interests lie in combining multiomics, clinical data, and machine learning to derive predictive models for mortality and morbidity in very preterm infants. His research has resulted in multiple funded projects, including predictive models and biomarker discovery in preterm infants, and the evaluation of the microbiome induced epigenetic changes in intestinal inflammation and necrotizing enterocolitis. Prof. Pammi has also led studies in metagenomics of the blood microbiome in preterm infants with sepsis. A recipient of several awards, like the Arnold J. Rudolph award and Norton Rose Fulbright Award, Pammi is dedicated to advancing evidence-based medicine and practice in Pediatrics. His publications include research on hematological biomarkers in surgical necrotizing enterocolitis and leptin deficiency in uteroplacental insufficiency.'),\n",
- " 0.36823577),\n",
- " (Document(page_content='9596410.pdf\\n=====\\nDr. Shilpa Jain is an Associate Professor at Baylor College of Medicine in the Department of Pathology & Immunology. She has been involved in extensive research and her primary interests lie in pathologies such as Menorrhagia, von Willebrand diseases, and other areas within nuclear physics and hematology. She has partnered with various researchers such as Daniel Curry, Maya Balakrishnan, and Ramya Masand, among others, leading to significant contributions within her field. Jain has published notable research papers, including studies on the association between dietary patterns and metabolic dysfunction-associated steatotic liver disease in Hispanic patients and Higgs Boson Pair Production in proton-proton collisions. This wealth of experience and extensive expertise in her field demonstrates her suitability for collaboration and mentorship.'),\n",
- " 0.36881453),\n",
- " (Document(page_content=\"20201893.pdf\\n=====\\nKenneth Muldrew is an Associate Professor at Baylor College of Medicine in both the Department of Pathology & Immunology and the Department of Medicine, specifically within the Medicine-Infectious Disease Division. His primary research interests and areas of expertise include molecular diagnostic techniques, infectious diseases, vaginal smears, and the study of placenta and gammaproteobacteria. He has numerous publications, with a recurrent focus on antibiotic resistance, identifying causative organisms in infections, and developments in diagnostic testing for infectious diseases. He has also contributed to the area of placental testing for cases of suspected congenital Zika Syndrome. Working closely with other experts such as Barbara Trautner and Anthony Maresso, Professor Muldrew's work has largely revolved around improving our understanding and management of infectious diseases, making him a suitable candidate for collaborations or mentorship in this field.\\n\"),\n",
- " 0.36975446),\n",
- " (Document(page_content=\"269718.pdf\\n=====\\nJames Versalovic is a Professor at Baylor College of Medicine where he serves in the Department of Pathology & Immunology, Department of Molecular & Human Genetics, and Department of Molecular Virology & Microbiology, among others. His main research interests include gut L-Histidine metabolism and histamine signaling in colonic neoplasia, the human microbiome in pediatric abdominal pain and intestinal inflammation, NF-kB signaling modulation, and the effect of probiotics in pediatric Crohn's Disease. He has notably secured various NIH grants as Principal Investigator for these research areas. Additionally, James Versalovic has contributed to publications in prominent scientific journals such as Nature Protocols and Pediatric Research, and his work has been cited multiple times within the scientific community. His roles at the Baylor College of Medicine and contributions in the field of microbiome and pediatric health make him a valuable asset as a mentor or collaborator.\\n\"),\n",
- " 0.37079132)]"
- ]
- },
- "execution_count": 82,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "db.similarity_search_with_score(df['Mentee_Summary'][0], k=20)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ac7dd44c",
- "metadata": {},
- "source": [
- "## Running similarity search for the analysis "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 83,
- "id": "f821cab1",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1174/1174 [03:05<00:00, 6.32it/s]\n"
- ]
- }
- ],
- "source": [
- "sim_res = [db.similarity_search_with_score(mentee_cv, k=df.shape[0], fetch_k=df.shape[0]) for mentee_cv in tqdm(df['Mentee_Summary'].values)]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 84,
- "id": "321c6015",
- "metadata": {},
- "outputs": [],
- "source": [
- "res_acc = []\n",
- "for k in range(1,201):\n",
- " search_sucessful = []\n",
- " rankings = []\n",
- " for i, sim in enumerate(sim_res):\n",
- " ground_truth = df[\"Mentor Profile\"][i]\n",
- "\n",
- " found = False\n",
- " for rnk, res in enumerate(sim[:k]):\n",
- " res_mentor_id = res[0].page_content.split(\"=====\")[0].strip()\n",
- " if ground_truth in res_mentor_id:\n",
- " found = True\n",
- " rankings.append(rnk+1)\n",
- " break\n",
- " search_sucessful.append(found)\n",
- "\n",
- " accuracy = sum(search_sucessful) / len(search_sucessful)\n",
- " avg_rank = sum(rankings) / len(rankings)\n",
- " res_acc.append({\"k\": k, \"accuracy\": accuracy, \"hits\": sum(search_sucessful), \"avg_rank\": avg_rank})"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 85,
- "id": "20be69f7",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
k
\n",
- "
accuracy
\n",
- "
hits
\n",
- "
avg_rank
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
195
\n",
- "
196
\n",
- "
0.940375
\n",
- "
1104
\n",
- "
17.593297
\n",
- "
\n",
- "
\n",
- "
196
\n",
- "
197
\n",
- "
0.940375
\n",
- "
1104
\n",
- "
17.593297
\n",
- "
\n",
- "
\n",
- "
197
\n",
- "
198
\n",
- "
0.940375
\n",
- "
1104
\n",
- "
17.593297
\n",
- "
\n",
- "
\n",
- "
198
\n",
- "
199
\n",
- "
0.941227
\n",
- "
1105
\n",
- "
17.757466
\n",
- "
\n",
- "
\n",
- "
199
\n",
- "
200
\n",
- "
0.941227
\n",
- "
1105
\n",
- "
17.757466
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " k accuracy hits avg_rank\n",
- "195 196 0.940375 1104 17.593297\n",
- "196 197 0.940375 1104 17.593297\n",
- "197 198 0.940375 1104 17.593297\n",
- "198 199 0.941227 1105 17.757466\n",
- "199 200 0.941227 1105 17.757466"
- ]
- },
- "execution_count": 85,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "pd_res = pd.DataFrame(res_acc)\n",
- "pd_res.tail()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6d502b3a",
- "metadata": {},
- "source": [
- "## top-K Accuracy"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 86,
- "id": "cf1410e0",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 86,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "sns.lineplot(x='k', y='accuracy', data=pd_res)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 87,
- "id": "aa2d9ca6",
- "metadata": {},
- "outputs": [],
- "source": [
- "search_sucessful = []\n",
- "rankings = []\n",
- "for i, sim in enumerate(sim_res):\n",
- " ground_truth = df[\"Mentor Profile\"][i]\n",
- "\n",
- " found = False\n",
- " for rnk, res in enumerate(sim):\n",
- " res_mentor_id = res[0].page_content.split(\"=====\")[0].strip()\n",
- " if ground_truth in res_mentor_id:\n",
- " found = True\n",
- " rankings.append(rnk+1)\n",
- " break"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5238743a",
- "metadata": {},
- "source": [
- "## simiarity rank distrubtion\n",
- "\n",
- "The visualizations of the rank could guide us choosing a reasonable k, and the cummulative plot show that $k = 36$ might be a resonable choice to get a decent accuracy for candidate selection, but with lower candidate selection. i.e., greater $k$ will make the final selection harder."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 88,
- "id": "eed58ab7",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 88,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "sns.displot(np.array(rankings), log_scale=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 89,
- "id": "26b4f40e",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAGnCAYAAABLpnZwAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy80BEi2AAAACXBIWXMAAA9hAAAPYQGoP6dpAABUKklEQVR4nO3dd3hTZf8G8DtN96aU7pZWQDatgPRX2VIpQ5ClCChDQEFQoMpSoQyhCoqgoPiCLF94WVZUplLLkClTUTaFQhejdNOVnN8fD00IHbRp2jSn9+e6cjU55zkn3zShuXme55yjkCRJAhEREZFMmBm7ACIiIiJDYrghIiIiWWG4ISIiIllhuCEiIiJZYbghIiIiWWG4ISIiIllhuCEiIiJZYbghIiIiWWG4ISIiIllhuCEiIiJZMWq4OXDgAHr16gUvLy8oFAps27btidvs27cPLVu2hJWVFerXr481a9ZUep1ERERkOowabrKyshAYGIhly5aVqX1sbCx69uyJzp0748yZM5g4cSJGjRqFPXv2VHKlREREZCoU1eXCmQqFAj/++CP69OlTYpupU6dix44dOHfunGbZq6++itTUVOzevbsKqiQiIqLqztzYBZTHkSNHEBoaqrMsLCwMEydOLHGb3Nxc5Obmah6r1WqkpKSgdu3aUCgUlVUqERERGZAkScjIyICXlxfMzEofeDKpcJOUlAR3d3edZe7u7khPT8eDBw9gY2NTZJvIyEjMnj27qkokIiKiSnTz5k34+PiU2sakwo0+pk+fjvDwcM3jtLQ0+Pn54ebNm3B0dDRiZURERFUrO68AKZl5OBV3H/ez8zTLo07Fw9HaAoUDGqfiUvV+jjr2lojo3RSdGrpVsFpd6enp8PX1hYODwxPbmlS48fDwQHJyss6y5ORkODo6FttrAwBWVlawsrIqstzR0ZHhhoiIZCuvQI01h2ORlavCv4npOHj5DnLy1SVvkK4NO2ZWtsU2mRzWEN7ONmhb3xWW5kWHhqwtzGBlrqxw7aUpy5QSkwo3ISEh2Llzp86y3377DSEhIUaqiIiIqPIVqNT4Kz4N+QVq7Lt0B3kF2pCSnafC/47HwaeW7n/yb91/UOL+bCyUeJCvQt9nvDXL8lRqvNjcU/O4bm07NPIQvSQKRdlCRXVh1HCTmZmJK1euaB7HxsbizJkzcHFxgZ+fH6ZPn474+HisW7cOADBmzBgsXboUU6ZMwRtvvIHff/8dmzdvxo4dO4z1EoiIiAzu2p1MxN7Nwo6/ErHj70TkFpTS4/JQaWHm9f+ri6y8ArRv4IoXmnjA3sqk+jbKzaiv7sSJE+jcubPmceHcmGHDhmHNmjVITExEXFycZn1AQAB27NiBSZMmYcmSJfDx8cHKlSsRFhZW5bUTERGVV75KjQKV7hlYzt5KxZStfyEuJRvmZgoUqEs/Q0u9OnZITMvBsOf8NctUagn+te3QxEt3uoVPLRu42hedmiF31eY8N1UlPT0dTk5OSEtL45wbIiKqVPkqtWbi7p5/kjFj27knbKEr0NcZkCRM6dYIDT0camRQKVSe729590sREREZQVZuAS7fzkSfZYfKvE19N3ssf60lHKwtAABuDlYmNc+lOmG4ISIi/Q0fDqSmAmW4NqAcqdQSlv5+BafW/Yi130/FSx//gjtmNnjuj+2YGb0CmLgJAGD2MKOoJeCrQc/g+Ubaw6QVCsDWkl/HhsSrghMRyV1GBjBxIlC3LmBjAzz3HPDnn7ptJAmYORPw9BRtQkOBy5e1669fF9/CZ85UYeHVR75KjeOxKVhx4BquNXsWR3q9jjbz9qLeBzvxxd5LOOz+NJ4d9z3OpgMJaTma7ZRmCrzYwhPXInviWmRPXP+kJ3oFesHOylxzq7Rgk58PzJkD1KsHWFsDgYHA45cqUqmAGTOAgADxvterB8ydKz4PJoxRkYhI7kaNAs6dA77/HvDyAv77XxFe/v0X8H54KPCCBcCXXwJr14ovuhkzgLAw0cba2rj1l1V+PmBhUeHd3MvMxU9nEnD02j0cvHwXeSo1VI9M8m2enoPziem4naG9tE++0gJjX22Lxp5iLkidH2LhcMgcV+f3qHA9evvoI/Fer1gBNGoE7NkD9O0LHD4MPPOMaPPpp8A334j3vWlT4MQJYMQIwMkJePdd49VeUVINk5aWJgGQ0tLSjF0KEVHly86WJKVSkrZv113esqUkffihuK9WS5KHhyQtXKhdn5oqSVZWkvS//4nH4v/y2lvHjmL5sGGS9NJLYlsPD0lycZGkt9+WpLy8kmuKiJCkwEBJWr5cknx8JMnGRpJeflk856NWrJCkRo1EHQ0bStKyZdp1sbGijo0bJalDB9Fm9Wqx7rvvJKlJE0mytBQ1jRun3e7+fUkaOVLKreUiZVvbSecbt5ZmzV0vtf74N6nu1O3SF20HSf+4BUgTe4ZLNx3dpDRLW+nnRu2lJhM3S3Wnbpe2NOtS5HeRcOZfSYqJEY/v3xfPs3q1JDk56b6ebdsk6ZlnRK0BAZI0a5Yk5eeX/HuqKE9PSVq6VHdZv36SNGSI9nHPnpL0xhult6kmyvP9zWEpIiI5KygQQw+P977Y2AB//CHux8YCSUmiN6eQkxMQHAwcOSIeHz8ufu7dCyQmAlFR2rYxMcDVq+Ln2rXAmjXiVporV4DNm4FffhFDJadPA2+/rV2/fr0YJps3Dzh/Hpg/X/QmrV2ru59p04AJE0SbsDDRCzFuHPDmm8DffwM//wzUr69t//LLUCUlo3+vGej++iIcsPfFO/PHIu/2XU0Tv9QkdL18FJOHf4z1Hy1Dl9sXcFA6jjMzX8CAP34AQkKA0aPF7yExEZ7Nni79tQLAwYPA0KGi1n//Bb79VvyO5s0reZv16wF7+9JvBw+WvH1ubunvOyCGKKOjgUuXxOOzZ8X67t2f/JqqMQ5LERHJmYOD+DKeOxdo3Bhwdwf+9z8RWgq/9JOSxM/HLkwMd3ftujp1xM/atQEPD912tWoBS5cCSqUY/ujZU3xhjh5dcl05OcC6ddphsa++Ett9/rnYf0SEuN+vn1gfEKANBcOGafczcaK2DQB8/DHw3nsiRBR69lkAQPKOvah15BiajVmHPHMxfHUn4mNYjDmF/9hex7W+g9Fd7Q+70wp0OboT3Ws5ie0z/oXtgQOArSUAS8DSErC1Lfp7KM3s2SKIFdb+1FPiPZkyRbzW4vTuLQJmaby9S14XFgYsWgR06CDm0kRHi1CqUmnbTJsGpKeL902pFOvmzQOGDCn7a6uGGG6IiOTu+++BN94QX4RKJdCyJTBoEHDypGH237Sp2G8hT0/Ra1IaPz/dL+aQEECtBi5eFIHs6lVg5EjdgFRQIHqUHtW6tfb+7dtAQgLQpYt2UUYOxv73FE7euI/XT23HrOwsnP5yEADATKGAzVIz4MEDBKvvI7iNH7DTEvD3h2WtR57H01PsuyLOngUOHdLtqVGpRMjLzhZh6XEODuKmryVLxO+vUSMxGbxePTGfZtUqbZvNm0UP0YYN4n08c0YERi8v3RBpYhhuiIjkrl49YP9+ICtL/C/d0xMYOFD0HgDaHojkZLGuUHIyEBT05P0/PolXoRBBRV+ZmeLnihVFey6Uj12U0c5Oe/+RCygnpD7AS8sO4c4jk37t8nJw264WXh0ciQldGqBfSx/tts7O2vuGfj2AeE2zZ+v2MhUqacL2+vXAW2+Vvt9du4D27YtfV6eOOEQ/Jwe4d08ElmnTtO87AEyeLJa9+qp43Lw5cOMGEBnJcENERCbAzk7c7t8XR84sWCCWBwSIgBMdrQ0z6enAsWPA2LHisaWl+PnokEZFxMWJXhYvL/H46FHAzAxo2FAMh3l5AdeulXl4RJIkfH44AQOd3LHtw2/xeYdsnfVeTtboPeJFeP7xPfZPfwHw99e/dkvL8v8eWrYUvVKPzv95kooOSxWythbt8vOBH34AXnlFuy47W/zeH6VUVjzMGRnDDRGR3O3ZI47radhQTOSdPFkMVYwYIdYrFGIo4uOPgQYNtIeCe3kBffqINm5uomdk927Ax0d8YT4+RFQe1taiZ+Czz0SQevdd8aVb2Is0e7ZY5uQEdOsmJseeOCGC2cPrEBbacuImJm/9CwCQ2HYw5v26DPdsnbDvqVaoa6nG0roP4DB5kvgdrAwRr2nBAuDpp0XA2rFDHCL96BBXafz9RfC7fl1M6nVxefI2M2cCL74ohuMGDBCB4uxZcYj+xx8Xv01Fh6WOHQPi40VgjY8HZs0SoWXKFG2bXr3EUJmfnxiWOn1azNN54w39n7caYLghIpK7tDRg+nTg1i3xRdy/v/hCe3T4ZcoUMWz15pvijMPt2okgUzhkYm4uzoMzZ474om7fHti3T/+a6tcXQzQ9egApKeKL/+uvtetHjRLzUBYuFGHMzg5S8+a4P3osslKykZGQjiYAPttzEUvv22s2+6F5F7zRygMfbvwOEQdWQ+HqKsIEIELczp3Ahx+KYHfnjghTHToUnUxdmvffF8GsSRPgwQNxtNmThIUB27eL39+nn4rffaNG4nVWlpwcca6ba9dECOvRQ8y/enQI7quvRJB9+20xr8jLSwyFzZxZeXVVAV44k4iIqtasWWIuSBnOdpyTr8I3+67iz+spOHz1XqltP+zRGK+09oWTbcVP5EfVDy+cSUREJkmSJFxMzsCV25kYv+F0ie1sLZXIzlOhkYcD/Fxs8VbHemhVt1YVVkrVGcMNEREZ1T8Jaej55R+awFKSfs94Y3CwH1r7l2GOC9VoDDdERFSl4t6ZgsmeYai9/iR2/p2kWf54sHG2tUDrurUQ0aspfF2KOQ8MUQkYboiIqFIVqNT45a8EZOWqcOrGfUSdji+2XeeGdTC9R2PYWirhU4thhvTHcENERAaRnVeAQSuOAZIEMzMFAECtlnD2VpqmjWvWfbxy5U9cqe0LdUgI+rX0ho2FEv1a+kD5cBuiimK4ISKiCpu341+sOFj8IdH+KfHoevkoul4+ipYJFwAocOu9D+A3rm3VFkk1BsMNERGVy7U7mXh7/SmkZudrliWl52gbSBI2PmMOt5jdcIveBfurlyDZ2EARFgb0mQr07Ak/V1cjVE41BcMNERGV6EGeCoev3kW+SsLpm/eRkJqDX84mFGlnocpHcNw5rHC6BZud24EF8eIK4r16AZ8vgOKFF4q/OCRRJWC4ISIiHSlZedh/6TbWHLquM1/mcZ29rDHb6hYcd++Aw+97oExPB+rWBV5+WVzioG1bcWZjoirGTx0REWlIkoSWc38rstzOUolGno54cDMBkx5cQLM/Y+D55x9AXp64dlF4uAg0LVqIyxwQGRHDDRFRTXb9uuhtUSh0LkAJiIzSrakH5ja1guveXcB/twFHjogVHTqIi0++9FLFrrBNVAkYboiIaqply4Dx45EWcxCBux8ZfpIktEi6jCiv2zCP/Bn4919xRfCwMGD1aqBnT4ATgqkaY7ghIpIjlQo4eBBITAQ8PcVVvJVKzWpp4UIopkzBmv/rh1m7UmGhLkBw3Dl0vXwUL8efhM3tJO2E4PnzAU4IJhPCcENEJDdRUcCECcCtW9plPj7AkiVA377A3LlQRERgeZt++MutARZv/wzPXz0Bx9wsSHXrQjH4VU4IJpOmkCRJMnYRVak8l0wnIjI5UVHAgAHAY3/a85QWSLeyRXpwWzwVsxNXa3nBJ/02rFQFSHqqEWoNfhlWA/pzQjBVW+X5/mYkJyKSC5VK9Ng8DDYSgB+bdsZVFx8se24gPvj9O7wZ8yMkAHb5OYiu1wbmL/dH13GDAA8PhhqSDYYbIiK5OHgQuHULaigQXf9ZjO4/U2f1luZd4JOWDHO1Ck1y7qJb7J8wm3cYmPce0KYNcOyYkQonMiyGGyIimciJT8TYARGIqfdskXV9z/2Ovv/EoN31MzDbsB4YNAgoKACuXRNHQ1lZGaFiosrBcENEZOK2nY7HxE1nADgCjwWbj35fiVF/btPdwNNT/DQ3B55+WtyIZIThhojIBOUWqHDoyl18u/8ajsWmFFm/bV04ghIv6S5UKMRRU+3bV1GVRMbBcENEZEJyC1RYvu8avth7qci6BQNaIOzan3B6dUDRDQsnCy9erHO+GyI5YrghIqrm0rLzMf5/pxCXko0b97KLrO/7jDcGB/vhWX8XoLUvYL61+PPcLF4M9OtXdYUTGQnDDRFRNfbXrVT0XnqoyHIHK3OseeNZtKrrUnSjfv3ENZ9KOUMxkZwx3BARVUMZOfkIifwdmbkFOsuXv9YS//dUbTjbWpa+A6US6NSp8gokqsYYboiIqpErtzPx4lcHkZOv1lk+oUsDTAxtAAVPtEf0RGbGLoCIyKQNHy6uw1QBkiTh4OU7mLTpDEIX7dcJNm38XXB5XndMeuHp6h1s9u0Tk5ZTU8XjNWsAZ2fj1UM1GsMNEcmfSgXMmAEEBAA2NkC9esDcubrXX5IkYOZMMT/FxgYIDQUuX9auv35dfHmfOWPQ0mIu3kbA9J14/bvj+PF0vGZ592YeODc7DJvHhMBCWc3+VHfqBEycqLvsuefE/B4nJ2NUVLyoKKB1axGy7OyAoCDg+++Ltjt/HujdW9RuZwc8+ywQF1fV1ZIBcViKiOTv00+Bb74B1q4FmjYFTpwARowQX2bvvivaLFgAfPmlaBMQIMJQWJg4e6+1tUHLeZCnQlxKNsb+9ySu3c3SWfd8IzeEv/A0mnnrERLy8wELCwNVWU6WluL6VNWJiwvw4YdAo0aivu3bxfvu5ibeWwC4ehVo1w4YORKYPRtwdAT++cfg7zlVMamGSUtLkwBIaWlpxi6FiKpKz56S9MYbusv69ZOkIUPEfbVakjw8JGnhQu361FRJsrKSpP/9TzwWfTvaW8eOYvmwYZL00ktiWw8PSXJxkaS335akvDzNrtRqtRS+6YzUaWGM1O7TaOmLtoOkf9wCpOlh46R4B1cp29xK+uu5rlJq4h3dGleskKRGjUQdDRtK0rJl2nWxsaKOjRslqUMH0Wb1arHuu+8kqUkTSbK0FDWNG6fd7v59SRo5UpJcXSXJwUGSOneWpDNntOsjIiQpMFCS1q2TpLp1JcnRUZIGDpSk9HTt6338dxEbK0kxMeL+/fui3erVkuTkpPt6tm2TpGeeEbUGBEjSrFmSlJ//+LtVuZ55RpI++kj7eOBASXrttaqtgfRSnu/vatbXSURUCZ57DoiOBi49PPHd2bPAH38A3buLx7GxQFKSGIoq5OQEBAcDR46Ix8ePi59794rhl6gobduYGNEDEBMjen7WrBE3AGq1hN5LD+GHU7cQezcLN1MeAADq3k/ES5cOYe6bn0C1cyea374Gp/cmaPe5fr0YJps3TwybzJ8vepPWrtV9bdOmiXPanD8veiO++QYYNw54803g77+Bn38G6tfXtn/5ZeD2bWDXLuDkSaBlS6BLFyDlkbMcX70KbNsmejq2bwf27wc++USsW7IECAkBRo8Wv4fERMDX98nvwcGDwNChotZ//wW+/Vb8jubNK3mb9esBe/vSbwcPPvm5ARHDoqOBixeBDh3EMrUa2LFDXH4iLEz06AQHi9dOpq0Kwla1wp4bohpIpZKkqVMlSaGQJHNz8XP+fO36Q4dEr0NCgu52L78sSa+8Iu4X9pScPq3bZtgw0cNRUKC73cCB0tGrd6W6U7fr3P6MvSfFT5giqZVKSbp1S7vNrl2SZGYmSYmJ4nG9epK0YYPuc82dK0khIbr1LF6s28bLS5I+/LD438PBg6InJidHd3m9epL07bfifkSEJNnaantqJEmSJk+WpOBg7eOOHSVpwgTdfTyp56ZLF93fuSRJ0vffS5KnZ/G1SpKo4fLl0m/Z2SVvL0miB87OTrzvVlaiV6tQYqKo2dZWkhYtEu9tZKT4fOzbV/p+qcqV5/ubc26ISP42bxa9ABs2iDk3Z86ICbFeXsCwYRXff9OmOifIkzw8kPDHnxj4n6OaZbXtLHFo2vOwtlACzjaAnx/g7a3dR0iI6Em4eBFwcBC9JyNHih6SQgUFRSfstm6tvX/7NpCQIHpiinP2LJCZCdSurbv8wQPxfIX8/UUNhTw9xb4r4uxZ4NAh3Z4alQrIyQGyswFb26LbODjo1qEPBwfxfmdmip6b8HDgqafEpGj1w6PSXnoJmDRJ3A8KAg4fBpYvBzp2rNhzk9Ew3BCR/E2eLIZvXn1VPG7eHLhxA4iMFOGmcCJscrL2itmFj4OCnrz/h5N407Lz8fX+K3A/fANN7mZqVk8KfRrjn68PpVkZD+XOfLjtihVimORRj59l2M5Oe9/G5sn79fQUh20/7tHDth+flKxQaIOAvjIzxYTd4i7/UNLk3fXrgbfeKn2/u3aVfiFQMzPtsFxQkBi+i4wU4cbVVVwZvUkT3W0aNxbDlmSyGG6ISP6ys8WX3KOUSu0XdkCACDjR0dowk54OHDsGjB0rHls+PCOwSlXsU8TezULnz/YBAGY+snz1iGfRuaFb0Q3i4kQvi5eXeHz0qKixYUPA3V0sv3YNGDKk7K/TwUH0ukRHA507F13fsqWYW2RuLtrpy9KyxN9DiVq2FL1Sj87/eZLevYuGu8c92vtVFmo1kJsr7ltaisO+L17UbXPpElC3bvn2S9UKww0RyV+vXmI4xM9PDCGdPg0sWgS88YZYr1CIYaqPPwYaNNAeCu7lpT1Bn5ub6BnZvVtchNLaWjNEVKCW0Psr3f/p+9W2wz+zw2BnVcKfWWtr0Wv02WciSL37LvDKK9pepNmzxTInJ6BbN/GFfOIEcP++GFopyaxZwJgxot7u3YGMDDEc9M47YsJ0SIh4TQsWiIm0CQliUm3fvrpDXKXx9xfB7/p1ManXpZjrWz1u5kzgxRfFezBggAhyZ88C586J33txKjosFRkpXlO9euL3t3OnOM/NN99o20yeDAwcKCYZd+4s3t9ffim+d4tMBo+WIiL5++or8YX69ttiyOH998Vwx9y52jZTpogA8Oab4n/zmZnii65wyMTcXJwH59tvReh56SUAQGZuAX6/cBsZD68B1cDNHkOC/eDlZF1ysAFED0a/fkCPHkDXrkCLFsDXX2vXjxoFrFwJrF4thtE6dhRHFwUElP5ahw0TV//++msR5F58UXsyQoVCfMF36CDO9/L002Ko7sYN0VtUVu+/L3q+mjQB6tQp2wnvwsLEkVe//ip+v//3f8AXX1RuD0lWlnjPmzYF2rYFfvgB+O9/xe+2UN++Yn7NggXi97xypWjXrl3l1UWVTiFJj56iU/7S09Ph5OSEtLQ0ODo6GrscIjJB/yakY8ZP53Dyxn2d5T61bLD9nXZPvqjlrFnicGMDn+2YSM7K8/3NYSkiolJIkoR7WXlQSxJO3UjFhI2nkVtQdHLtoDZ+mN+3WfW+/hNRDcFwQ0RUgszcAoxeewJHrt0rdn3XJu54t0sDNPF0hFlZj4QiokrHYSkiosdIkoQTN+7j5eVHil0//Dl/TOveSJyzhoiqBIeliIj0dC8zF+9vOYuYi3c0y5p4OmLjW/8HR2sjXZSSiMqF4YaI6CGVWkK3JQdxJyNXs2xyWEOM61yOc7M8SXw88NNP4iihnj0Nt18i0mC4ISICcOt+NubtOK8JNk29HPHloGdQr459xXYsSeKsuNu2iduff4rDyqdNY7ghqiQMN0RU4y3ffxWf7LqgeexgZY5t49rCQqnnqcDUanGSu8JAc+mSuExCt27ixHw9ewK1ahmkdiIqiuGGiGqstOx8bD5xUyfY+NSywX9eb13+YJObC/z+uwgzP/0krktVp464hMDnn4uLWT7p2k9EZBAMN0RUY606FIsl0Zc1j78c9AzCmrrDyryMR0GlpYkz/m7bJn5mZoorTr/2mrjEQUhI0QtdElGlY7ghohpHpZbwxpo/cSxWnL+msacjRrcPQO9ArydvnJAA/Pwz8OOPQEwMkJ8vLgo5daoINE2bisscEJHRMNwQUY1y8kYK5u+8oLl0gn9KPMb2bYzeLX1K3ujCBe38mWPHRG9Mx45iuOmll8TFIImo2mC4IaIa4W5mLiJ3XsAPp25plvU99zsW7VoMRddNQPBT2sZqNXD8uDbQXLwI2NqKCcHr1okJwWW5EjYRGQXDDRHJk0oFHDwIKSERc9NcsOpGgc7qeXePYPDOL6B44w1xZei8PN0JwUlJgKurmBC8cCEQGsoJwUQmQs/jHA1n2bJl8Pf3h7W1NYKDg3H8+PFS2y9evBgNGzaEjY0NfH19MWnSJOTk5FRRtURkEqKiAH9/5HcJxR8fLtAJNo7W5thrfhZDvpsHxahR4iim114TRzZ17w78+isweDBw4IAION99B/TqxWBDZEKM2nOzadMmhIeHY/ny5QgODsbixYsRFhaGixcvws3NrUj7DRs2YNq0aVi1ahWee+45XLp0CcOHD4dCocCiRYuM8AqIqNqJigIGDAAkCSNfno0DT7UCAFjl5+Lrnz9Fhxa+sIj6AfD3B1avBlasEBOC339fTAhu1owTgolMnFHDzaJFizB69GiMGDECALB8+XLs2LEDq1atwrRp04q0P3z4MNq2bYvBgwcDAPz9/TFo0CAcO3asSusmompKpQImTBBnBQZwxdUXAOCSnYb+56LR5cpx4MrD3uEHD4CuXcWwU4cOQP36gAWvHUUkB0YblsrLy8PJkycRGhqqLcbMDKGhoThypPgr8T733HM4efKkZujq2rVr2LlzJ3r06FHi8+Tm5iI9PV3nRkQydfAgcOsWMi1tsLNhW2RbWAMA1m6eiQ9jVmnbvfCCCDNHjgBjxgBNmgA+PuKwbiIyeUbrubl79y5UKhXc3d11lru7u+PChQvFbjN48GDcvXsX7dq1gyRJKCgowJgxY/DBBx+U+DyRkZGYPXu2QWsnomoqMREAENlpBNY/o/1Pj7lapdtuxAhg0CDRw3P7trj204MH7LkhkgmjTyguj3379mH+/Pn4+uuvcerUKURFRWHHjh2YO3duidtMnz4daWlpmtvNmzersGIiqkoX7d3x+itz8EvjDgCAevduYtjJX9Dwzg3dhp6e4qdCAbi7A506icnERCQLRuu5cXV1hVKpRHJyss7y5ORkeHh4FLvNjBkz8Prrr2PUqFEAgObNmyMrKwtvvvkmPvzwQ5iZFc1qVlZWsLKyMvwLIKJq5ca9LPwnuxYOBrTULPvo95XofO2ktpFCIYaf2rc3QoVEVFWM1nNjaWmJVq1aITo6WrNMrVYjOjoaISEhxW6TnZ1dJMAoH163RXo4gZCIap69/yaj48J9+OF0AgAg7NJhRH3/Pjo9HmwAYPFiXu+JSOaMOiwVHh6OFStWYO3atTh//jzGjh2LrKwszdFTQ4cOxfTp0zXte/XqhW+++QYbN25EbGwsfvvtN8yYMQO9evXShBwiqjkkScKf11Ow61wSAMDGQomn3e3xVp/WaGmWCZ0Dun18gK1bgX79jFIrEVUdox4KPnDgQNy5cwczZ85EUlISgoKCsHv3bs0k47i4OJ2emo8++ggKhQIfffQR4uPjUadOHfTq1Qvz5s0z1ksgIiO4nZGDz/ZcxOYTt3SWv9jCEwtfDhQPhvQWR08lJoo5Nu3bs8eGqIZQSDVsPCc9PR1OTk5IS0uDo6OjscshonK4cjsDn+y6iL3nk4usa1u/Nt7r2hAt/WoZoTIiqmzl+f7mtaWIqNr7/UIyvvr9Ck7Hpeosd7KxwIwXm+CFJu5wsuFh3EQkMNwQUbW37sgNnWDTxt8Fo9oHoMPTdWBtwaEmItLFcENE1dasn//B9r8SkfYgDwAwql0Awpp5oJVfLZiZ8fpPRFQ8hhsiqnZOxd1HYmoO1h+7gXyVmBZopgB6BXoh0NfZuMURUbXHcENE1UJqdh6iz9/G2VupWHdE94zCG0YFo767PdwcrI1UHRGZEoYbIjKqrNwCpGTl4b0tZ3E8NkVnXXCAC4J8nfFcfVcjVUdEpojhhoiMJiUrDx0XxCAjt0Bnedcm7ng9pC7aN6hjpMqIyJQx3BCR0Vy/l6UJNraWSrjaW+G/I4PhV9vWyJURkSljuCGiKnc8NgU/nr6FOxm5AAA/F1scmNLZyFURkVww3BBRlckrUEMtSZiz/R+ci0/XLOcJ+IjIkBhuiKhK/PpPEsb/7zTyCtSaZa8+6wu/2rZ4obG7ESsjIrlhuCGiKnEsNkUn2NRxsMKUbo3gYmdpxKqISI4Yboio0iSkPsCkTWeQkpWHO5lifs2odgGY9MLTsDI3g7nSzMgVEpEcMdwQUaXZf+kOjj127pqn6tjDzop/eoio8vAvDBEZVGZuAb6Kvow7mbm4dicLANC6bi2817UhHKzN0dTL0cgVEpHcMdwQkUFFn0/Gtweu6Sx7qo4dQurVNlJFRFTTMNwQkUHlPpw0XK+OHQY+6wsLpRl6tvA0clVEVJMw3BBRpfBzscWbHeoZuwwiqoEYboiowtKy8/HfYzeQnpOPC4kZxi6HiGo4hhsiqrDNJ25i4Z6LOst4RBQRGQv/+hCR3gpUYn5NRk4+AKCxpyPa1a8NC6UZXm7ta8zSiKgGY7ghIr0s2XsZi6MvQZK0y1rXrYUPezYxXlFERAB4elAi0kvMxds6wUZppkBr/1rGK4iI6CH23BBRhSx5NQidnnaDhbkCtpb8k0JExse/RERUJpIkYd2RG4hLyQYA3Lr/AABgZ2kOJ1sLY5ZGRKSD4YaIyuSfhHRE/PxPkeX21vwzQkTVC/8qEVGZZOUWAACcbS0wqI0fAMDLyRrP+rsYsywioiIYboioXGrbWWJqt0bGLoOIqEQMN0RUov8evYEfTt0CAGTmFBi5GiKismG4IaISLf39CpLSc3SWeTrZGKkaIqKyYbghIo0Nx+KwLOYK1A9PYJOcIYLNjBebwLeWDcwUCjwbwDk2RFS9MdwQkcaWkzcRn/pAZ5mNhRL9W3rD2dbSSFUREZUPww1RDfbL2QT8cjZB8/jq7UwAoqcm+GEPjbezDYMNEZkUhhuiGmz+zvNITMspsrylnzOaeTsZoSIioopjuCGqwfIfXtX73S4N4OFoDQDwdLJGkK+zEasiIqoYhhsiQs/mnmjo4WDsMoiIDIJXBSciIiJZYc8NUQ2RlJaDgf85guRHzluTk682YkVERJWD4YaohjgVdx837mUXWV7bzhLetXhiPiKSD4Ybohom0McJy4a01Dx2tbeCtYXSiBURERkWww1RDWNlroRPLVtjl0FEVGkYbohkKiUrD1cenpQPgM59IiI5Y7ghkqG8AjVeWLQf97LyiqxTKIxQEBFRFWK4IZKh7LwCTbB5ytUOeBhozM0UGBzsZ8TKiIgqH8MNkcz9OqkDzJU8pRUR1Rz8i0dERESywp4bIhm4l5mL+NQHmscZOQVGrIaIyLgYbohMXEpWHtp++jvPNkxE9BDDDZGJS0h9gJx8NcwU0FzZu9Dzjd0434aIahyGGyKZcHOwxuHpXYxdBhGR0fG/dERERCQr7LkhMiG7zyXh2wNXoVJLmmXZeSojVkREVP0w3BCZkNWHYnE6LrXYdV7O1sUuJyKqaRhuiEyIWhI9Nm93qodn/V20KxRAS79aRqqKiKh6YbghMkHNvZ3QuZGbscsgIqqWGG6Iqqm07HxIkHSW5aukEloTEVEhhhuiamjK1rPYfOKWscsgIjJJPBScqBo6FptS4ro6DlZo4etcdcUQEZkY9twQVWNbxoQUmShspgAUCoWRKiIiqv4YboiqMTOFAkozBhkiovLgsBQRERHJCntuiIwkIycfv/6TjAf5Rc8wnJFTYISKiIjkgeGGyEiW77+KZTFXS21jZc7OVSKi8jL6X85ly5bB398f1tbWCA4OxvHjx0ttn5qainHjxsHT0xNWVlZ4+umnsXPnziqqlshwUrLyAAD13ewR1tS9yO2tjk+hiaejkaskIjI9Ru252bRpE8LDw7F8+XIEBwdj8eLFCAsLw8WLF+HmVvTsq3l5eXjhhRfg5uaGrVu3wtvbGzdu3ICzs3PVF09kIH2CvDD++QbGLoOISDaMGm4WLVqE0aNHY8SIEQCA5cuXY8eOHVi1ahWmTZtWpP2qVauQkpKCw4cPw8LCAgDg7+9flSUTERFRNWe0cJOXl4eTJ09i+vTpmmVmZmYIDQ3FkSNHit3m559/RkhICMaNG4effvoJderUweDBgzF16lQolcpit8nNzUVubq7mcXp6umFfCFEJ8lVq9Fl2CP8k8DNHRFSVjDbn5u7du1CpVHB3d9dZ7u7ujqSkpGK3uXbtGrZu3QqVSoWdO3dixowZ+Pzzz/Hxxx+X+DyRkZFwcnLS3Hx9fQ36OohKEn//wRODjYVSgUCebZiIyKBM6mgptVoNNzc3/Oc//4FSqUSrVq0QHx+PhQsXIiIiothtpk+fjvDwcM3j9PR0BhyqUnaWSuyf0rnYdTYWSthZmdQ/QyKias9of1VdXV2hVCqRnJysszw5ORkeHh7FbuPp6QkLCwudIajGjRsjKSkJeXl5sLS0LLKNlZUVrKysDFs8UTmYKRRwtednkIioqhhtWMrS0hKtWrVCdHS0ZplarUZ0dDRCQkKK3aZt27a4cuUK1Gq1ZtmlS5fg6elZbLAhIiKimseo57kJDw/HihUrsHbtWpw/fx5jx45FVlaW5uipoUOH6kw4Hjt2LFJSUjBhwgRcunQJO3bswPz58zFu3DhjvQQiIiKqZvQallKpVFizZg2io6Nx+/ZtnZ4UAPj999/LtJ+BAwfizp07mDlzJpKSkhAUFITdu3drJhnHxcXBzEybv3x9fbFnzx5MmjQJLVq0gLe3NyZMmICpU6fq8zKIiIhIhvQKNxMmTMCaNWvQs2dPNGvWDAqF/lctHj9+PMaPH1/sun379hVZFhISgqNHj+r9fERERCRveoWbjRs3YvPmzejRo4eh6yEyGQmpD/Dj6Xjkq9TFrk/Nzq/iioiICNAz3FhaWqJ+/fqGroXIpHz260VEnYrXPN64YRr+dXsKc0Lf1GlnY1n8CSZla80aYOJEIDVVPJ41C9i2DThzxlgVEVENo9eE4vfeew9LliyBJEmGroeocs2aBSgUurdGjXTb5OQA48YBtWsD9vZA//7Ao6cs2LcPUChQcO8+AKBNgAte+z8/uDtao5GnA177Pz+d24IBLars5VU5f39g8WLdZQMHApcuGaOakpXlfSci2dCr5+aPP/5ATEwMdu3ahaZNm2qu81QoKirKIMURVYqmTYG9e7WPzR/7ZzBpErBjB7BlC+DkBIwfD/TrBxw6VOzu+gR5Y3CwH7DYDgH1XPFcn+aVWHwZSBKgUhV9XVXFxkbcqpsnve9EJBt69dw4Ozujb9++6NixI1xdXXUub+Dk5GToGokMy9wc8PDQ3lxdtevS0oDvvgMWLQKefx5o1QpYvRo4fBg4ehS4fh3oLM42/OXbz+P6py/i/+a8p91erQamTAFcXMS+Z80qvZbhw4E+fYDZs4E6dQBHR2DMGCAvT3efkZFAQIAIDYGBwNat2vUPe5Kwa5eo18oK+OMPsd2CBUD9+mKZnx8wb552u5s3gVdeAZydRb0vvSRe3+O1ffYZ4OkperLGjQPyH84l6tQJuHFDhMHC3hBADEs5O5f+uleuBBo3BqytRQ/K11+X3t4QSnvfiUhW9Pqvy+rVqw1dB1HVuXwZ8PISX6whISI4+PmJdSdPii/v0FBt+0aNxPojR4B33wV++AHo3x8ffboVe+IeYErfIDxV2HbtWiA8HDh2TLQfPhxo2xZ44YWS64mOFrXs2yfCxYgRIkgUBpHISOC//wWWLwcaNAAOHABee02EoY4dtfuZNk0EkaeeAmrVAqZPB1asAL74AmjXDkhMBC5cEG3z84GwMPH6Dx4UX/wffwx06wb89RdQeFLMmBgRbGJigCtXxJBTUBAwejQQFSWC1ptvisdltX49MHMmsHQp8MwzwOnTYns7O2DYsOK3mT9f3Erz77/a97E4pb3vRCQrFeqXvXPnDi5evAgAaNiwIerUqWOQoogqTXCw6Flo2FB82c+eDbRvD5w7Bzg4AElJ4ov98Z4Hd3exTqkUvRwA0h1ccMc+G/n2jtp2LVoAhdc5a9BAfIFHR5cebiwtgVWrAFtbMXQyZw4weTIwd64IIfPni+GUwjN3P/WU6Jn59lvdcDNnjvZ5MjKAJUvE8xcGhnr1RMgBgE2bRM/OypXaHpfVq8Xr3rcP6NpVLKtVS+xDqRQhr2dP8XpGjxa/B6VS/N5KuGRKsSIigM8/F0N9gOiR+vdf8XpKCjdjxoheptJ4eZW87knvOxHJil7hJisrC++88w7WrVunOYGfUqnE0KFD8dVXX8HW1tagRRIZTPfu2vstWogvvbp1gc2bgZEjK77/Fo9NHvb0BG7fLn2bwEARbAqFhACZmWLYKDMTyM4uGo7y8kSvx6Nat9beP38eyM0FunQp/jnPnhU9MY9/sefkAFevah83bSoCzKOv5++/S389pcnKEvsfOVK3t6egQMxvKomLiyZU6qWy33ciqlb0Cjfh4eHYv38/fvnlF7Rt2xaAmGT87rvv4r333sM333xj0CKJKo2zM/D00+KLHhA9EHl54jDmR3tvkpPL1jvx2OR6KBSih0RfmZni544dgLe37rrHLwhrZ6e9/6QJvZmZYn7O+vVF1z3aA1tZr2fFChEwHqUs5ZB5QwxLPerx952IZEWvcPPDDz9g69at6NSpk2ZZjx49YGNjg1deeYXhhkxHZqboSXj9dfG4VSvAwgJ5e35D7kt9AACKSxdhHxeHrJbPQp2TDzPJDHYA1AUFhqnh7FngwQNtIDl6VByC7usreiusrIC4ON0hqCdp0EDsLzoaGDWq6PqWLcXQlJubmMSsL0tLcWRWWbm7i+Gja9eAIUPKvl1Fh6Ue9/j7TkSyole4yc7O1lz/6VFubm7Izs6ucFFEleb994FevcSQREKCmP+hVAKDBon1Tk6I6zsIyjfH4f1tV5BhZYvZvy0HvBqh/640YNevcM+4iyNQwGr3LrjUaw3z7KyK1ZSXJ4ZGPvpITCiOiBCHn5uZiWGj998XRySp1WLOTFqaOCzd0bHkOSrW1sDUqeLILUtLMan5zh3gn3/Ecw0ZAixcKI6QmjMH8PERRz5FRYltfHzKVru/v5jg/OqrIoSV5Qik2bPFxGwnJzGBOTcXOHECuH9fTMYuTkWHpZ70vhORrOgVbkJCQhAREYF169bB2toaAPDgwQPMnj0bIYWTHomqo1u3xBfavXti+KVdO9FT8shQzPqBE+F9+R6+2TYflqp8HAhoiRkvvK1Zn+zgii/aDcbU/WuwcOdipCcPAjoXM7xTVl26iJ6WDh3EF/2gQbqHkM+dK+qLjBQ9Hs7Oouflgw9K3++MGeIoqJkzxRe6p6foAQHEHJ8DB0QA6tdPTED29ha1lKcnZ84c4K23xGTl3Fxxjp0nGTVKPP/ChWLitJ0d0Ly5OKtxZSnD+05E8qGQ9DjN8Llz5xAWFobc3FwEBgYCAM6ePQtra2vs2bMHTZs2NXihhpKeng4nJyekpaXBsSLd8SRbH2//Fyv/iMXo9gF4r2vDUtuamylgrtTrdFHC8OFifs+2bfrvg4ioBijP97dePTfNmjXD5cuXsX79elx4eN6MQYMGYciQIbCpjmcmJdKD0swM1hY17LpQREQyoPd5bmxtbTG6PCfuIiIiIqoCZQ43P//8M7p37w4LCwv8/PPPpbbt3bt3hQsjqhHWrDF2BUREslPmcNOnTx8kJSXBzc0Nffr0KbGdQqGAqjyHhhIREREZUJnDjfqRE3epK3ISL6KaLCFBXJtKoRCHexMRkcHpdZjHunXrkJubW2R5Xl4e1q1bV+GiiGQlIQH46itxqLePD/Dee+LyCEREVCn0CjcjRoxAWlpakeUZGRkYMWJEhYsiMnnFBRoHB3GBzORkYNkyY1dIRCRbeh0tJUkSFIVXEn7ErVu34FTaxe+I5KxwyGnLFnHVbnNzccHLVavEmYBr1TJ2hURENUK5ws0zzzwDhUIBhUKBLl26wNxcu7lKpUJsbCy6detm8CKJqq3ERBFoNm9moCEiqibKFW4Kj5I6c+YMwsLCYG9vr1lnaWkJf39/9O/f36AFEhnC7nNJuH6vbNeAOnsrtfQGDDRERNVaucJNREQEVCoV/P390bVrV3h6elZWXUQGc/VOJsb892SpbereT0DbG2exIai7Zpm1xSNT0goDzZYtwMGDDDRERNVYuefcKJVKvPXWWzjPoz3IRKQ9yAcA2Fkq0b150UBeJ+E6xvznQ6TVdkfeSHHWbXsrcwz2tQCWLtUGGqWSgYaIyATofW2pa9euISAgwND1EBmeSpyXqTby8Vmd+0D79iKoAMDFi8CEsYBnHTj9vhefqdW6PTSFgea770SgcXEx4gshIqKy0OtQ8I8//hjvv/8+tm/fjsTERKSnp+vciKqNqCigb19xPzkZ6NwZ8PcXyy9eFI8dHIAhQ4BXXgG8vYFJkwA7OxFokpOBnTuBESMYbIiITIRePTc9evQAIK4h9egh4YWHiPPyC1QtREUBAwYAnk/rLo+PB/r3B2xtxePERCAiAggNZQ8NEZEM6BVuYmJiDF0HkWGpVMCECYAkFV1XuCw7GzAzA7y8gMBAoEkTIC8PyMxkuCEiMmF6hZuOHTsaug4iwzp4ELh168ntxo4V13m6cgXYtg24fh0YOlRMGiYiIpOkV7gBgNTUVHz33Xeao6aaNm2KN954g2copuohMbFs7dq2BQYN0j7Oz9dONiYiIpOk14TiEydOoF69evjiiy+QkpKClJQULFq0CPXq1cOpU6cMXSNR+ZX1HEyPt7OwEENVRERksvTquZk0aRJ69+6NFStWaC7BUFBQgFGjRmHixIk4cOCAQYskKrf27cUFK+Pji1+vUIj17dtXbV1ERFTp9O65mTp1qs61pczNzTFlyhScOHHCYMUR6U2pBJYsefjgsYu8Fh7ht3gxh6CIiGRIr3Dj6OiIuLi4Istv3rwJBweHChdFZBD9+gFbtwJ16ugu9/ERy/v1M05dRERUqfQKNwMHDsTIkSOxadMm3Lx5Ezdv3sTGjRsxatQoDHp0ciaRsfXrB/z4o7jv7g7ExACxsQw2REQyptecm88++wwKhQJDhw5FQUEBAMDCwgJjx47FJ598YtACiQrdycjFt/uvIiOnoFzb3cvKFXfs7YFOnQxfGBERVSt6hRtLS0ssWbIEkZGRuHr1KgCgXr16sC084ytRJdh68hZW/hGr9/aONnqf+YCIiExIhf7a29rawtnZWXOfqDLl5IvLegT5OuOFJu7l3l6fbYiIyPToFW4KCgowe/ZsfPnll8jMzAQA2Nvb45133kFERAQsLCwMWiTRo5p7O2Fc5/rGLoOIiKopvcLNO++8g6ioKCxYsAAhISEAgCNHjmDWrFm4d+8evvnmG4MWSURERFRWeoWbDRs2YOPGjejevbtmWYsWLeDr64tBgwYx3BAREZHR6HUouJWVFfz9/YssDwgIgKWlZUVrIiIiItKbXuFm/PjxmDt3LnJzczXLcnNzMW/ePIwfP95gxRERERGVl17DUqdPn0Z0dDR8fHwQGBgIADh79izy8vLQpUsX9HvkBGlRUVGGqZSIiIioDPQKN87Ozujfv7/OMl9fX4MURERERFQReoWb1atXG7oOIiIiIoOo0En87ty5g4sXLwIAGjZsiDqPX6CQiIiIqIrpNaE4KysLb7zxBjw9PdGhQwd06NABXl5eGDlyJLKzsw1dIxEREVGZ6RVuwsPDsX//fvzyyy9ITU1FamoqfvrpJ+zfvx/vvfeeoWskIiIiKjO9hqV++OEHbN26FZ0eucJyjx49YGNjg1deeYUn8SMiIiKj0SvcZGdnw9296EUI3dzcOCxFRfx1KxW/X7hd4f0cvXbPANUQEZHc6RVuQkJCEBERgXXr1sHa2hoA8ODBA8yePVtzrSmiQhM2nkHs3SyD7c/WUmmwfRERkfzoFW4WL16Mbt26FTmJn7W1Nfbs2WPQAsn0ZeTkAwB6tvBELduKXTHe1tIcw57zN0BVREQkV3qFm+bNm+Py5ctYv349Lly4AAAYNGgQhgwZAhsbG4MWSPLx7vMN0NDDwdhlEBGRzJU73OTn56NRo0bYvn07Ro8eXRk1EREREemt3IeCW1hYICcnpzJqISIiIqowvc5zM27cOHz66acoKCgwdD1EREREFaLXnJs///wT0dHR+PXXX9G8eXPY2dnprOeVwImIiMhYDHZVcCIiIqLqoFzhRq1WY+HChbh06RLy8vLw/PPPY9asWTxCioiIiKqNcs25mTdvHj744APY29vD29sbX375JcaNG1dZtRERERGVW7nCzbp16/D1119jz5492LZtG3755ResX78earW6suojIiIiKpdyhZu4uDj06NFD8zg0NBQKhQIJCQkVKmLZsmXw9/eHtbU1goODcfz48TJtt3HjRigUCvTp06dCz09ERETyUa5wU1BQoLmWVCELCwvk5+frXcCmTZsQHh6OiIgInDp1CoGBgQgLC8Pt26VfaPH69et4//330b59e72fm4iIiOSnXBOKJUnC8OHDYWVlpVmWk5ODMWPG6BwOXp5DwRctWoTRo0djxIgRAIDly5djx44dWLVqFaZNm1bsNiqVCkOGDMHs2bNx8OBBpKamludlEBERkYyVK9wMGzasyLLXXntN7yfPy8vDyZMnMX36dM0yMzMzhIaG4siRIyVuN2fOHLi5uWHkyJE4ePBgqc+Rm5uL3NxczeP09HS96yUiIqLqr1zhZvXq1QZ98rt370KlUsHd3V1nubu7u+aCnI/7448/8N133+HMmTNleo7IyEjMnj27oqUSERGRidDr8gvGkpGRgddffx0rVqyAq6trmbaZPn060tLSNLebN29WcpVERERkTHqdodhQXF1doVQqkZycrLM8OTkZHh4eRdpfvXoV169fR69evTTLCg9DNzc3x8WLF1GvXj2dbaysrHTmCBEREZG8GbXnxtLSEq1atUJ0dLRmmVqtRnR0NEJCQoq0b9SoEf7++2+cOXNGc+vduzc6d+6MM2fOwNfXtyrLJyIiomrIqD03ABAeHo5hw4ahdevWaNOmDRYvXoysrCzN0VNDhw6Ft7c3IiMjYW1tjWbNmuls7+zsDABFlhMREVHNZPRwM3DgQNy5cwczZ85EUlISgoKCsHv3bs0k47i4OJiZmdTUICIiIjIihSRJkrGLqErp6elwcnJCWloaHB0djV2OUanVEuJSslHZH4C+Xx9CanY+9kzsgIYeDpX8bEREJEfl+f42es8NGc/4/53Czr+TjF0GERGRQTHc1GD/JIgTGtpaKqE0U1TqczV0d0CAq92TGxIREVUQww3h+5HBaFW3lrHLICIiMgjO1CUiIiJZYbghIiIiWWG4ISIiIllhuCEiIiJZYbghIiIiWWG4ISIiIllhuCEiIiJZYbghIiIiWWG4ISIiIllhuCEiIiJZYbghIiIiWWG4ISIiIllhuCEiIiJZYbghIiIiWWG4ISIiIllhuCEiIiJZYbgxNZ06ARMnGrsK41uzBnB21j6eNQsICjJOLUREVK0w3FSGyEjg2WcBBwfAzQ3o0we4eFG3TadOgEKhexszRrt+3z6xLDW16uqurvz9gcWLdZcNHAhcumSMakp24ADQqxfg5SXeu23bjF0REVGNxHBTGfbvB8aNA44eBX77DcjPB7p2BbKydNuNHg0kJmpvCxYYp159SBJQUGC857exEcGxOsnKAgIDgWXLjF0JEVGNxnBTGXbvBoYPB5o2FV92a9YAcXHAyZO67WxtAQ8P7c3RUSy/fh3o3Fncr1VL9AIMH67dTq0GpkwBXFzEdrNmlV7P8OGi92j2bKBOHfE8Y8bAvCBfd5+RkUBAgAgOgYHA1q3a9YU9Sbt2Aa1aAVZWwB9/iO0WLADq1xfL/PyAefO02928CbzyihhCcnEBXnpJvL7Ha/vsM8DTE6hdWwTD/Ie1deoE3LgBTJqk7eECig5LFWflSqBxY8DaGmjUCPj669LbV1T37sDHHwN9+1bu8xARUakYbqpCWpr46eKiu3z9esDVFWjWDJg+HcjOFst9fYEffhD3L14UvTpLlmi3W7sWsLMDjh0TwWLOHNFDVJroaOD8eRFS/vc/ICoKw/eu066PjATWrQOWLwf++UeEiddeE71Qj5o2DfjkE7GvFi1E3Z98AsyYAfz7L7BhA+DuLtrm5wNhYWJ47uBB4NAhwN4e6NYNyMvT7jMmBrh6Vfxcu1YElzVrxLqoKMDHR7zGwh6usli/Hpg5UwSt8+eB+fNFjWvXlrzN/PmivtJucXFle34iIjIac2MXIHtqtZgA3LatCDGFBg8G6tYV8zP++guYOlUEmagoQKnUBiE3t6I9FC1aABER4n6DBsDSpSK8vPBCyXVYWgKrVoneoqZNgTlz0H9COGa2HghFbq74Yt+7FwgJEe2fekr0zHz7LdCxo3Y/c+ZonycjQ4SupUuBYcPEsnr1gHbtxP1Nm8TrX7lS2+OyerV4Pfv2iaE6QPROLV0qXnejRkDPnuL1jB4tfg9KpQhIHh5l/71HRACffw706yceBwSI8PXtt9paHzdmjOhlKo2XV9lrICIio2C4qWzjxgHnzomg8Kg339Teb95cDMl06SJ6MOrVK32fLVroPvb0BG7fLn2bwEARbAqFhMA27wG80u/C6sY10Wv0eDjKywOeeUZ3WevW2vvnzwO5uaLu4pw9C1y5IoLJo3JyxOss1LSpCDCPvp6//y799ZQmK0vsf+RIEZAKFRQATk4lb+fiUrR3jYiITA7DTWUaPx7Yvl0cRePjU3rb4GDx88qVJ4cbCwvdxwqF6CHRk1nhROcdOwBvb92VVla6j+3stPdtbErfcWammJ+zfn3RdXXqaO8b+PUgM1P8XLFC+3st9GiIetz8+eJWmn//FfOKiIio2mK4qQySBLzzDvDjj2L4JSDgyducOSN+enqKn5aW4qdKZZiazp4FHjzQBpKjR5FtaYMER1fkNGgoQkxcnO4Q1JM0aCD2Fx0NjBpVdH3LlmJoys1NO1laH5aW5fs9uLuL4aNr14AhQ8q+HYeliIhkgeGmMowbJybW/vSTGJJJShLLnZxEGLh6Vazv0UMcHfTXX2ICb4cO2iGnunVFD8b27aKdjY2Y0KqvvDwxTPPRR+JopYgIRLXtA0lhBrW9A/D++6IGtVrMmUlLExOAHR1LnqNibS3mCk2ZIgJI27bAnTtiQvLIkSJYLFwojpCaM0f0Xt24IeYVTZny5N6sQv7+ovfr1VdFCHN1ffI2s2cD774rfufduonhsxMngPv3gfDw4rep6LBUZqboeSsUGytCq4sLe3uIiKoQw01l+OYb8bNTJ93lq1eLQ58tLcXk3cWLxfwQX1+gf38RPAp5e4sv6GnTgBEjgKFDtUcQ6aNLF9HT0qGD+KIfNAhrAvoDaQ/PVTN3rhgqiowUPR7OzqLn5YMPSt/vjBmAubk4MikhQfQ8FZ6M0NZWhJKpU8XE3owM8bq6dClfT86cOcBbb4nhutxc0TP2JKNGiedfuBCYPFkMpzVvXrlndz5xQnsIP6ANUcOGVey9IyKiclFIUlm+KeQjPT0dTk5OSEtLg2NFhkpMyfDh4kzHj50xt+PCGNy4l40fxj6HVnVrGaMyIiKiMinP9zfPc0NERESywnBDREREssI5NzUB53sQEVENwp4bIiIikhWGG7nKyADi441dBRERUZVjuJGTjAxxUcy+fcVh3aVda4qIiEimGG5M3eOBZvBgceXsefOAX381dnVERERVjhOKTVFGhjhz8ebNwK5d4sR2wcEi0AwYIM5uTEREVEMx3BjQjr8Scf1eVqXs2yI7E/5HYlB/3y74Hd8P8/w8JDUOxNURk3C1QzdkeDy84OW1fODaldJ39lBqdn6l1EpERGRMDDcGcuV2JsZtOGXQfdrlZqPL1T/R88JBdLp2ElaqfJz2bIhP2r2GXQ3bId7JTTQ8mwmcvaj389hYlHKlbCIiIhPDcGMg6TmiF8TOUolegSVcOVqSYJWdiVw7hxL3Y/kgC41PHkDzw7+i4ek/YJGfh7gGzbF3yDv4O+QFpNYR+25voLr9Xe3Q2LPkeoiIiEwNw42B1ba3wif9WxRdoVKJazzt3w/ExemuK5xDs2WLmEOTkyPm0ETOBwYMgF/duvAD0LMqXgAREZGJY7gxFJVa/MzMBPbtA9q3B5QPh3sKg82GDeIGlBxoPv6Yk4KJiIgqgIeCG0JUlDgUGwCSk4HOnQF/f7H80WDz3XeAWg306we4uYnDthMSRKC5fh04ehR47z0GGyIiogpgz01FRUWJnhbPp3WXx8cD/fsDbdsChw8DrVsDY8eyh4aIiKiSMdxUhEoFTJgASFLRdYXLDh0SP2/fBoYMAV57TQQeC4uqq5OIiKgG4bBURRw8CNy69eR23t5AUpIYlurcGbCxAaZPr/z6iIiIaiD23FREYmLZ2i1cCAwcKILQ5cviFhRUqaURERHVVAw3FeHpWfZ2ZmaAn5+4delSuXURERHVYByWqoj27QEfH0ChKH69QgH4+op2REREVCUYbipCqQSWLHn44LGAUxh4Fi/Wnu+GiIiIKh3DTUX16wds3QrUqaO73MdHLO/Xzzh1ERER1VAMN4bQrx/w44/ivrs7EBMDxMYy2BARERkBJxQbivJhTrS3Bzp1MmopRERENRl7boiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVqpFuFm2bBn8/f1hbW2N4OBgHD9+vMS2K1asQPv27VGrVi3UqlULoaGhpbYnIiKimsXo4WbTpk0IDw9HREQETp06hcDAQISFheH27dvFtt+3bx8GDRqEmJgYHDlyBL6+vujatSvi4+OruHIiIiKqjowebhYtWoTRo0djxIgRaNKkCZYvXw5bW1usWrWq2Pbr16/H22+/jaCgIDRq1AgrV66EWq1GdHR0FVdORERE1ZFRw01eXh5OnjyJ0NBQzTIzMzOEhobiyJEjZdpHdnY28vPz4eLiUuz63NxcpKen69yIiIhIvowabu7evQuVSgV3d3ed5e7u7khKSirTPqZOnQovLy+dgPSoyMhIODk5aW6+vr4VrpuIiIiqL6MPS1XEJ598go0bN+LHH3+EtbV1sW2mT5+OtLQ0ze3mzZtVXCURERFVJXNjPrmrqyuUSiWSk5N1licnJ8PDw6PUbT/77DN88skn2Lt3L1q0aFFiOysrK1hZWRmkXiIiIqr+jNpzY2lpiVatWulMBi6cHBwSElLidgsWLMDcuXOxe/dutG7duipKJSIiIhNh1J4bAAgPD8ewYcPQunVrtGnTBosXL0ZWVhZGjBgBABg6dCi8vb0RGRkJAPj0008xc+ZMbNiwAf7+/pq5Ofb29rC3tzfa6yAiIqLqwejhZuDAgbhz5w5mzpyJpKQkBAUFYffu3ZpJxnFxcTAz03YwffPNN8jLy8OAAQN09hMREYFZs2ZVZelERERUDRk93ADA+PHjMX78+GLX7du3T+fx9evXK78gIiIiMlkmfbQUERER0eMYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVhhuiIiISFYYboiIiEhWGG6IiIhIVqpFuFm2bBn8/f1hbW2N4OBgHD9+vNT2W7ZsQaNGjWBtbY3mzZtj586dVVQpERERVXdGDzebNm1CeHg4IiIicOrUKQQGBiIsLAy3b98utv3hw4cxaNAgjBw5EqdPn0afPn3Qp08fnDt3roorJyIiourI6OFm0aJFGD16NEaMGIEmTZpg+fLlsLW1xapVq4ptv2TJEnTr1g2TJ09G48aNMXfuXLRs2RJLly6t4sqJiIioOjI35pPn5eXh5MmTmD59umaZmZkZQkNDceTIkWK3OXLkCMLDw3WWhYWFYdu2bcW2z83NRW5uruZxWloaACA9Pb2C1evKzEiHOjcbBTmSwfdNRERU0xV+t0qS9MS2Rg03d+/ehUqlgru7u85yd3d3XLhwodhtkpKSim2flJRUbPvIyEjMnj27yHJfX189qy7dTQBOEZWyayIiohovIyMDTk5OpbYxaripCtOnT9fp6VGr1UhJSUHt2rWhUChK3O7ZZ5/Fn3/+Wa516enp8PX1xc2bN+Ho6Fjx4qtAaa+zuj6Pvvsqz3ZlbVuWdjXlswRUzefJ1D5L5Wn/pHb6rjfFz5Op/W2qyH74t6lsJElCRkYGvLy8ntjWqOHG1dUVSqUSycnJOsuTk5Ph4eFR7DYeHh7lam9lZQUrKyudZc7Ozk+sTalUlvjGlbYOABwdHU3mD8iTXkt1fB5991We7cratiztaspnCaiaz5OpfZbK0/5J7Sq63pQ+T6b2t6ki++HfprJ7Uo9NIaNOKLa0tESrVq0QHR2tWaZWqxEdHY2QkJBitwkJCdFpDwC//fZbie31NW7cOL3WmZqqei2GfB5991We7cratiztaspnCaia12Nqn6XytH9Su4quNyWm9repIvvh3ybDU0hlmZlTiTZt2oRhw4bh22+/RZs2bbB48WJs3rwZFy5cgLu7O4YOHQpvb29ERkYCEIeCd+zYEZ988gl69uyJjRs3Yv78+Th16hSaNWtmzJeC9PR0ODk5IS0tzWT+d0TVEz9LZEj8PJGhmMpnyehzbgYOHIg7d+5g5syZSEpKQlBQEHbv3q2ZNBwXFwczM20H03PPPYcNGzbgo48+wgcffIAGDRpg27ZtRg82gBgCi4iIKDIMRlRe/CyRIfHzRIZiKp8lo/fcEBERERmS0U/iR0RERGRIDDdEREQkKww3REREJCsMN0RERCQrDDdVZPv27WjYsCEaNGiAlStXGrscMnF9+/ZFrVq1MGDAAGOXQibs5s2b6NSpE5o0aYIWLVpgy5Ytxi6JTFRqaipat26NoKAgNGvWDCtWrDBqPTxaqgoUFBSgSZMmiImJgZOTE1q1aoXDhw+jdu3axi6NTNS+ffuQkZGBtWvXYuvWrcYuh0xUYmIikpOTERQUhKSkJLRq1QqXLl2CnZ2dsUsjE6NSqZCbmwtbW1tkZWWhWbNmOHHihNG+59hzUwWOHz+Opk2bwtvbG/b29ujevTt+/fVXY5dFJqxTp05wcHAwdhlk4jw9PREUFARAXNrG1dUVKSkpxi2KTJJSqYStrS0AIDc3F5Iklenq3ZWF4aYMDhw4gF69esHLywsKhQLbtm0r0mbZsmXw9/eHtbU1goODcfz4cc26hIQEeHt7ax57e3sjPj6+KkqnaqiinyeiQob8LJ08eRIqlQq+vr6VXDVVR4b4LKWmpiIwMBA+Pj6YPHkyXF1dq6j6ohhuyiArKwuBgYFYtmxZses3bdqE8PBwRERE4NSpUwgMDERYWBhu375dxZWSKeDniQzFUJ+llJQUDB06FP/5z3+qomyqhgzxWXJ2dsbZs2cRGxuLDRs2FLnIdZWSqFwASD/++KPOsjZt2kjjxo3TPFapVJKXl5cUGRkpSZIkHTp0SOrTp49m/YQJE6T169dXSb1UvenzeSoUExMj9e/fvyrKJBOg72cpJydHat++vbRu3bqqKpWquYr8XSo0duxYacuWLZVZZqnYc1NBeXl5OHnyJEJDQzXLzMzMEBoaiiNHjgAA2rRpg3PnziE+Ph6ZmZnYtWsXwsLCjFUyVWNl+TwRlUVZPkuSJGH48OF4/vnn8frrrxurVKrmyvJZSk5ORkZGBgAgLS0NBw4cQMOGDY1SL1ANLpxp6u7evQuVSqW50Gchd3d3XLhwAQBgbm6Ozz//HJ07d4ZarcaUKVN4pBQVqyyfJwAIDQ3F2bNnkZWVBR8fH2zZsgUhISFVXS5VY2X5LB06dAibNm1CixYtNHMsvv/+ezRv3ryqy6VqrCyfpRs3buDNN9/UTCR+5513jPo5YripIr1790bv3r2NXQbJxN69e41dAslAu3btoFarjV0GyUCbNm1w5swZY5ehwWGpCnJ1dYVSqSwycSo5ORkeHh5GqopMFT9PZCj8LJGhmOJnieGmgiwtLdGqVStER0drlqnVakRHR3OYgMqNnycyFH6WyFBM8bPEYakyyMzMxJUrVzSPY2NjcebMGbi4uMDPzw/h4eEYNmwYWrdujTZt2mDx4sXIysrCiBEjjFg1VVf8PJGh8LNEhiK7z5LRjtMyITExMRKAIrdhw4Zp2nz11VeSn5+fZGlpKbVp00Y6evSo8Qqmao2fJzIUfpbIUOT2WeK1pYiIiEhWOOeGiIiIZIXhhoiIiGSF4YaIiIhkheGGiIiIZIXhhoiIiGSF4YaIiIhkheGGiIiIZIXhhoiIiGSF4YaIiIhkheGGiIiIZIXhhoiIiGSF4YaIiIhkheGGiIiIZOX/AVVLl4cRM2ZJAAAAAElFTkSuQmCC",
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "sns.ecdfplot(np.array(rankings), log_scale=True)\n",
- "\n",
- "# Calculate desired percentiles\n",
- "percentiles = [25, 50, 80, 90]\n",
- "percentile_values = np.percentile(rankings, percentiles)\n",
- "\n",
- "# Add points for the specified percentiles\n",
- "for percentile, value in zip(percentiles, percentile_values):\n",
- " plt.scatter(value, percentile / 100, color='red', label=f'{percentile}th percentile')\n",
- " plt.annotate(f'{percentile}th percentile = {value:.0f}', \n",
- " xy=(value, percentile / 100), \n",
- " xytext=(value, (percentile / 100) + 0.05), # Position the text slightly above the point\n",
- " arrowprops=dict(arrowstyle='->', color='red'), \n",
- " color='red')\n"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.12.2"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/src/config/paths.py b/src/config/paths.py
index 20ada7a..4964adb 100644
--- a/src/config/paths.py
+++ b/src/config/paths.py
@@ -2,24 +2,21 @@
from src.config.model import EMBEDDING_MODEL
# Project root directory
-# Assumes the script is in src/config/paths.py
+# Assumes this file is in src/config/
ROOT_DIR = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-# File paths
-PATH_TO_SUMMARY = os.path.join(ROOT_DIR, "data/mentor_data_with_summaries.csv")
-PATH_TO_MENTOR_DATA = os.path.join(ROOT_DIR, "data/mentor_data.csv")
-PATH_TO_SUMMARY_DATA = os.path.join(ROOT_DIR, "data/summary_data.csv")
-PATH_TO_MENTOR_DATA_RANKED = os.path.join(
- ROOT_DIR, "data/mentor_data_summaries_ranks.csv"
-)
-PROFESSOR_TYPES_PATH = os.path.join(ROOT_DIR, "data/professor_types.txt")
+# --- Primary Data Paths ---
+DATA_DIR = os.path.join(ROOT_DIR, "data")
+DB_DIR = os.path.join(ROOT_DIR, "db")
-# FAISS index paths (dynamic based on embedding model)
-INDEX_DIR = os.path.join(ROOT_DIR, "db", EMBEDDING_MODEL)
+# The single, canonical CSV file for all mentor data.
+# This file is progressively enriched by the pipeline.
+PATH_TO_MENTOR_DATA = os.path.join(DATA_DIR, "mentor_data.csv")
+
+# --- FAISS Index Path ---
+# The path is dynamic based on the embedding model to avoid mismatches.
+INDEX_DIR = os.path.join(DB_DIR, EMBEDDING_MODEL)
os.makedirs(INDEX_DIR, exist_ok=True)
-INDEX_SUMMARY_WITH_METADATA = os.path.join(INDEX_DIR, "index_summary_with_metadata")
-INDEX_SUMMARY_ASSISTANT_AND_ABOVE = os.path.join(
- INDEX_DIR, "index_summary_assistant_and_above"
-)
-INDEX_SUMMARY_ABOVE_ASSISTANT = os.path.join(INDEX_DIR, "index_summary_above_assistant")
+# The primary FAISS index used for matching.
+INDEX_SUMMARY_WITH_METADATA = os.path.join(INDEX_DIR, "faiss_index")
diff --git a/src/processing/batch.py b/src/processing/batch.py
index 1a3e135..6f8cbe7 100644
--- a/src/processing/batch.py
+++ b/src/processing/batch.py
@@ -1,4 +1,3 @@
-import argparse
import asyncio
import json
import os
@@ -8,35 +7,33 @@
import tiktoken
from src.config.paths import ROOT_DIR
from src.config.client import get_async_openai_client
-from src.config.prompts import mentor_instructions, mentee_instructions
+from src.config.prompts import mentor_instructions
from src.config.model import LLM_MODEL
def truncate_text(text, max_tokens=3000):
+ """Truncates text to a maximum number of tokens."""
+ if not isinstance(text, str):
+ return ""
enc = tiktoken.encoding_for_model(LLM_MODEL)
tokens = enc.encode(text)
if len(tokens) > max_tokens:
truncated_tokens = tokens[:max_tokens]
- truncated_text = enc.decode(truncated_tokens)
- return truncated_text
+ return enc.decode(truncated_tokens)
return text
-def prepare_batch_input(data, instructions, column_name):
- if column_name not in data.columns:
- raise ValueError(
- f"Column '{column_name}' not found in DataFrame with columns: {data.columns}"
- )
-
+def prepare_batch_input(data):
+ """Prepares a list of batch requests for the OpenAI API."""
batch_input = []
for i, row in data.iterrows():
custom_id = f"request-{uuid.uuid4()}"
- message = truncate_text(row[column_name])
+ message = truncate_text(row["Mentor_Data"])
body = {
"model": LLM_MODEL,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
- {"role": "user", "content": f"{instructions}\n{message}"},
+ {"role": "user", "content": f"{mentor_instructions}\n{message}"},
],
"max_tokens": 1000,
}
@@ -47,154 +44,82 @@ def prepare_batch_input(data, instructions, column_name):
"body": body,
}
batch_input.append(request)
-
return batch_input
-def save_batch_input(batch_input, file_path):
- with open(file_path, "w") as f:
+async def submit_and_wait_for_batch(client, batch_input):
+ """Submits a batch job and waits for its completion."""
+ # Save batch input to a temporary file
+ input_file_path = os.path.join(ROOT_DIR, "data", "mentor_batch_input.jsonl")
+ with open(input_file_path, "w") as f:
for item in batch_input:
f.write(json.dumps(item) + "\n")
+ # Upload the file by passing the path directly to the client
+ with open(input_file_path, "rb") as f:
+ batch_input_file = await client.files.create(file=f, purpose="batch")
-async def submit_batch_job(client, input_file_path):
- async with aiofiles.open(input_file_path, "rb") as file:
- batch_input_file = await client.files.create(
- file=await file.read(), purpose="batch"
- )
- batch_input_file_id = batch_input_file.id
-
- return await client.batches.create(
- input_file_id=batch_input_file_id,
+ # Create the batch job
+ batch = await client.batches.create(
+ input_file_id=batch_input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
- metadata={"description": "test batch job"},
)
+ print(f"Submitted batch job with ID: {batch.id}")
-
-async def check_batch_status(client, batch_id):
+ # Wait for the batch to complete
while True:
- status = await client.batches.retrieve(batch_id)
+ status = await client.batches.retrieve(batch.id)
print(f"Current batch status: {status.status}")
- if status.status in ["completed", "failed"]:
+ if status.status in ["completed", "failed", "cancelled"]:
break
await asyncio.sleep(30)
+
+ os.remove(input_file_path) # Clean up temp input file
return status
-async def download_batch_results(client, status, output_file_path):
- if hasattr(status, "output_file_id") and status.output_file_id:
- file_response = await client.files.content(status.output_file_id)
- async with aiofiles.open(output_file_path, "w") as json_file:
- await json_file.write(file_response.text)
- else:
- if hasattr(status, "error_file_id") and status.error_file_id:
- error_response = await client.files.content(status.error_file_id)
- async with aiofiles.open(
- output_file_path.replace(".jsonl", "_error.jsonl"), "w"
- ) as json_file:
- await json_file.write(error_response.text)
- raise ValueError("Batch job failed. Error details saved to the error file.")
- else:
- raise ValueError(
- "Batch job did not produce an output file ID. Check the batch job status and input data."
- )
+async def get_batch_results(client, status):
+ """Downloads and processes batch results."""
+ if status.status != "completed" or not status.output_file_id:
+ raise ValueError(f"Batch job failed or was cancelled. Status: {status.status}")
+ file_response = await client.files.content(status.output_file_id)
-async def process_batch_results(file_path):
summaries = []
- async with aiofiles.open(file_path, "r") as f:
- async for line in f:
- result = json.loads(line)
- if (
- "response" in result
- and "body" in result["response"]
- and "choices" in result["response"]["body"]
- ):
- summary = result["response"]["body"]["choices"][0]["message"][
- "content"
- ].strip()
- summaries.append(summary)
- else:
- error_info = {
- "id": result.get("id"),
- "custom_id": result.get("custom_id"),
- "error": result.get("error", "No choices key in response body"),
- }
- print(f"Error processing result: {error_info}")
- summaries.append("Error: Unable to generate summary for this entry.")
+ results_data = file_response.text.strip().split("\n")
+ for line in results_data:
+ result = json.loads(line)
+ if (
+ "response" in result
+ and "body" in result["response"]
+ and "choices" in result["response"]["body"]
+ ):
+ summary = result["response"]["body"]["choices"][0]["message"][
+ "content"
+ ].strip()
+ summaries.append(summary)
+ else:
+ summaries.append("Error: Unable to generate summary.")
return summaries
-async def summarize_cvs(
- input_file_path,
- output_file_path,
- role="mentor",
- column_name="Mentor_Data",
-):
+async def summarize_cvs(df: pd.DataFrame) -> pd.DataFrame:
"""
- Summarizes CVs from an input CSV file and saves them to an output file.
+ Adds a 'Mentor_Summary' column to the DataFrame by summarizing 'Mentor_Data'.
"""
client = get_async_openai_client()
- data = pd.read_csv(input_file_path)
- instructions = mentor_instructions if role == "mentor" else mentee_instructions
- batch_input = prepare_batch_input(data, instructions, column_name)
+ batch_input = prepare_batch_input(df)
- batch_input_file_path = os.path.join(ROOT_DIR, "data", f"{role}_batch_input.jsonl")
- save_batch_input(batch_input, batch_input_file_path)
+ status = await submit_and_wait_for_batch(client, batch_input)
- batch = await submit_batch_job(client, batch_input_file_path)
+ summaries = await get_batch_results(client, status)
- print(f"Batch ID: {batch.id}")
-
- status = await check_batch_status(client, batch.id)
-
- if not hasattr(status, "output_file_id") or not status.output_file_id:
- print(f"Batch details: {status}")
+ if len(summaries) != len(df):
raise ValueError(
- "Batch job did not produce an output file ID. Check the batch job status and input data."
+ f"Number of summaries ({len(summaries)}) does not match number of mentors ({len(df)})."
)
- batch_output_file_path = os.path.join(
- ROOT_DIR, "data", f"{role}_batch_output.jsonl"
- )
- await download_batch_results(client, status, batch_output_file_path)
-
- summaries = await process_batch_results(batch_output_file_path)
-
- # Create a new DataFrame for summaries
- summary_df = pd.DataFrame(summaries, columns=[f"{role.capitalize()}_Summary"])
-
- # Merge the original data with the summaries
- # This assumes the summaries are in the same order as the original data
- merged_df = pd.concat([data, summary_df], axis=1)
-
- merged_df.to_csv(output_file_path, sep="\t", index=False)
- print(f"Summarized CVs saved to {output_file_path}")
-
-
-def main():
- parser = argparse.ArgumentParser(
- description="Preprocess and summarize documents in batch."
- )
- parser.add_argument(
- "--in", dest="input_file", required=True, help="Input CSV file to process"
- )
- parser.add_argument("--out", required=True, help="Output CSV file for summaries")
- parser.add_argument(
- "--role", choices=["mentor", "mentee"], default="mentor", help="Summary type"
- )
- parser.add_argument(
- "--col",
- dest="column_name",
- default="Mentor_Data",
- help="Name of the column to summarize",
- )
- args = parser.parse_args()
-
- asyncio.run(summarize_cvs(args.input_file, args.out, args.role, args.column_name))
-
-
-if __name__ == "__main__":
- main()
+ df["Mentor_Summary"] = summaries
+ return df
diff --git a/src/retrieval/build_index.py b/src/retrieval/build_index.py
index dcc0cf0..7d3f3eb 100644
--- a/src/retrieval/build_index.py
+++ b/src/retrieval/build_index.py
@@ -1,116 +1,38 @@
import pandas as pd
-import os
-from dotenv import load_dotenv
-from langchain_openai import OpenAIEmbeddings, ChatOpenAI
+from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
-from src.utils import find_professor_type, rank_professors
from src.config import paths
-from src.config.model import LLM_MODEL, EMBEDDING_MODEL
+from src.config.model import EMBEDDING_MODEL
-def build_index():
+def build_index(df: pd.DataFrame):
"""
- Builds and saves FAISS vector stores from mentor data.
- If the ranked data file already exists, it uses it. Otherwise, it creates it first.
+ Builds and saves a FAISS vector store from the mentor DataFrame.
"""
- load_dotenv()
- llm = ChatOpenAI(model=LLM_MODEL)
-
- if os.path.exists(paths.PATH_TO_MENTOR_DATA_RANKED):
- print(f"Loading existing ranked data from {paths.PATH_TO_MENTOR_DATA_RANKED}")
- merged_df = pd.read_csv(paths.PATH_TO_MENTOR_DATA_RANKED, sep="\t")
- else:
- print("Ranked data not found. Creating it from summaries...")
- summary_df = pd.read_csv(paths.PATH_TO_SUMMARY, sep="\t")
-
- # Add Professor_Type
- summary_df["Professor_Type"] = [
- find_professor_type(text) for text in summary_df["Mentor_Data"].fillna("")
- ]
-
- # Add Rank
- merged_df = rank_professors(summary_df)
-
- # Save the ranked data
- merged_df.to_csv(paths.PATH_TO_MENTOR_DATA_RANKED, sep="\t", index=False)
- print(f"Saved ranked mentor data to {paths.PATH_TO_MENTOR_DATA_RANKED}")
-
- # Ensure we have only the required columns
- merged_df = merged_df[
- ["Mentor_Data", "Mentor_Profile", "Mentor_Summary", "Professor_Type", "Rank"]
- ]
-
- # Create documents for assistant professors and above (Rank >= 1)
- docs_assistant_and_above = [
- p + "\n=====\n" + s
- for p, s, r in zip(
- merged_df["Mentor_Profile"].values,
- merged_df["Mentor_Summary"].values,
- merged_df["Rank"].values,
- )
- if r >= 1
- ]
-
- # Create documents for ranks higher than assistant professor (Rank > 1)
- docs_above_assistant = [
- p + "\n=====\n" + s
- for p, s, r in zip(
- merged_df["Mentor_Profile"].values,
- merged_df["Mentor_Summary"].values,
- merged_df["Rank"].values,
+ # Ensure required columns are present
+ required_cols = ["Mentor_Summary", "Mentor_Profile", "Professor_Type", "Rank"]
+ if not all(col in df.columns for col in required_cols):
+ raise ValueError(
+ f"DataFrame must contain the following columns: {required_cols}"
)
- if r > 1
- ]
+ # Create documents with metadata
docs_with_metadata = []
- # Create documents with metadata for both collections
- for _, row in merged_df.iterrows():
- # Create metadata dictionary
+ for _, row in df.iterrows():
doc_metadata = {
"Mentor_Profile": row["Mentor_Profile"],
"Professor_Type": row["Professor_Type"],
"Rank": row["Rank"],
}
-
- # Create document with page_content as Mentor_Summary and the metadata
doc = Document(page_content=row["Mentor_Summary"], metadata=doc_metadata)
- # Append to the list
docs_with_metadata.append(doc)
- # Create vector stores
+ # Create and save the vector store
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
- vector_store_docs_with_metadata = FAISS.from_documents(
+ vector_store = FAISS.from_documents(
documents=docs_with_metadata, embedding=embeddings
)
- vector_store_assistant_and_above = FAISS.from_texts(
- texts=docs_assistant_and_above, embedding=embeddings
- )
- vector_store_above_assistant = FAISS.from_texts(
- texts=docs_above_assistant, embedding=embeddings
- )
-
- # Create retrievers
- retriever_docs_with_metadata = vector_store_docs_with_metadata.as_retriever()
- retriever_assistant_and_above = vector_store_assistant_and_above.as_retriever()
- retriever_above_assistant = vector_store_above_assistant.as_retriever()
-
- # Save vector stores
- vector_store_docs_with_metadata.save_local(paths.INDEX_SUMMARY_WITH_METADATA)
- vector_store_assistant_and_above.save_local(paths.INDEX_SUMMARY_ASSISTANT_AND_ABOVE)
- vector_store_above_assistant.save_local(paths.INDEX_SUMMARY_ABOVE_ASSISTANT)
-
- print("Vector stores created and saved successfully.")
-
- return (
- vector_store_assistant_and_above,
- retriever_assistant_and_above,
- vector_store_above_assistant,
- retriever_above_assistant,
- vector_store_docs_with_metadata,
- retriever_docs_with_metadata,
- )
-
+ vector_store.save_local(paths.INDEX_SUMMARY_WITH_METADATA)
-if __name__ == "__main__":
- build_index()
+ print("Vector store created and saved successfully.")
diff --git a/src/utils.py b/src/utils.py
index 70841ad..16814a2 100644
--- a/src/utils.py
+++ b/src/utils.py
@@ -1,5 +1,6 @@
import re
import pandas as pd
+import os
def extract_and_format_name(mentor_data):
@@ -16,30 +17,18 @@ def clean_summary(summary):
return cleaned.strip()
-# use this to add a Professor_Type metadata column in the .csv file; allows us to search for
-# only professors of a specific typke
-import os
-from src.config.paths import PROFESSOR_TYPES_PATH
-
-
def get_professor_titles():
- """Reads a list of professor titles from the configuration file."""
- if not os.path.exists(PROFESSOR_TYPES_PATH):
- print(
- f"Warning: Professor types file not found at {PROFESSOR_TYPES_PATH}. Using default list."
- )
- return [
- "Chair",
- "Distinguished Professor",
- "Professor",
- "Associate Professor",
- "Assistant Professor",
- "Adjunct Professor",
- "Instructor",
- "Clinical Professor",
- ]
- with open(PROFESSOR_TYPES_PATH, "r") as f:
- return [line.strip() for line in f if line.strip()]
+ """Returns a hardcoded list of professor titles for ranking."""
+ return [
+ "Chair",
+ "Distinguished Professor",
+ "Professor",
+ "Associate Professor",
+ "Assistant Professor",
+ "Adjunct Professor",
+ "Instructor",
+ "Clinical Professor",
+ ]
def find_professor_type(mentor_data):
diff --git a/static/css/main.css b/static/css/main.css
deleted file mode 100644
index 592deba..0000000
--- a/static/css/main.css
+++ /dev/null
@@ -1,27 +0,0 @@
-.gradio-container {
- max-width: 100% !important;
-}
-
-h1 {
- text-align: center;
- margin-bottom: 20px;
-}
-
-
-.input-column, .summary-column {
- min-height: 300px;
-}
-
-.mentor-table {
- margin-top: 20px;
- margin-bottom: 20px;
- max-height: 400px;
- overflow-y: auto;
-}
-
-.download-row {
- display: flex;
- justify-content: flex-start;
- align-items: center;
- margin-top: 10px;
-}
\ No newline at end of file
diff --git a/static/css/mentor_table_styles.css b/static/css/mentor_table_styles.css
deleted file mode 100644
index 0675e7e..0000000
--- a/static/css/mentor_table_styles.css
+++ /dev/null
@@ -1,177 +0,0 @@
-/* Base styles for common elements */
-.table-container {
- max-height: 500px;
- overflow-y: auto;
-}
-
-.mentor-table {
- width: 100%;
- border-collapse: collapse;
- font-family: Arial, sans-serif;
-}
-
-.mentor-table th, .mentor-table td {
- padding: 12px;
- text-align: left;
- border: 1px solid;
-}
-
-.mentor-table th {
- font-weight: bold;
- position: sticky;
- top: 0;
- z-index: 1;
-}
-
-.mentor-name {
- width: 15%;
- font-weight: bold;
- cursor: pointer;
- position: relative;
-}
-
-.mentor-name:hover::after {
- content: attr(data-score);
- position: absolute;
- top: 100%;
- left: 0;
- padding: 5px 8px;
- border-radius: 4px;
- z-index: 2;
- white-space: nowrap;
- font-size: 0.9em;
- box-shadow: 0 2px 4px rgba(0,0,0,0.4);
-}
-
-.mentor-summary, .evaluation-summary {
- width: 25%;
-}
-
-.summary-content {
- max-height: 150px;
- overflow-y: auto;
- padding-right: 10px;
-}
-
-.criterion-score {
- width: 7%;
- text-align: center;
-}
-
-.overall-score {
- font-weight: bold;
-}
-
-/* Light mode styles */
-@media (prefers-color-scheme: light) {
- .table-container {
- background-color: #ffffff;
- }
-
- .mentor-table {
- color: #333333;
- }
-
- .mentor-table th, .mentor-table td {
- border-color: #dddddd;
- }
-
- .mentor-table th {
- background-color: #f5f5f5;
- }
-
- .mentor-name:hover::after {
- background-color: #f0f0f0;
- color: #333333;
- }
-
- .mentor-table tr:nth-child(even) {
- background-color: #f9f9f9;
- }
-
- .mentor-table tr:hover {
- background-color: #e9e9e9;
- }
-
- .table-container::-webkit-scrollbar-track,
- .summary-content::-webkit-scrollbar-track {
- background: #f0f0f0;
- }
-
- .table-container::-webkit-scrollbar-thumb,
- .summary-content::-webkit-scrollbar-thumb {
- background: #cccccc;
- }
-
- .table-container::-webkit-scrollbar-thumb:hover,
- .summary-content::-webkit-scrollbar-thumb:hover {
- background: #bbbbbb;
- }
-}
-
-/* Dark mode styles */
-@media (prefers-color-scheme: dark) {
- .table-container {
- background-color: #1e1e1e;
- }
-
- .mentor-table {
- color: #e0e0e0;
- }
-
- .mentor-table th, .mentor-table td {
- border-color: #333333;
- }
-
- .mentor-table th {
- background-color: #2c2c2c;
- }
-
- .mentor-name:hover::after {
- background-color: #4a4a4a;
- color: #ffffff;
- }
-
- .mentor-table tr:nth-child(even) {
- background-color: #252525;
- }
-
- .mentor-table tr:hover {
- background-color: #303030;
- }
-
- .table-container::-webkit-scrollbar-track,
- .summary-content::-webkit-scrollbar-track {
- background: #2c2c2c;
- }
-
- .table-container::-webkit-scrollbar-thumb,
- .summary-content::-webkit-scrollbar-thumb {
- background: #555555;
- }
-
- .table-container::-webkit-scrollbar-thumb:hover,
- .summary-content::-webkit-scrollbar-thumb:hover {
- background: #666666;
- }
-}
-
-/* Common scrollbar styles for Firefox */
-.table-container,
-.summary-content {
- scrollbar-width: thin;
-}
-
-@media (prefers-color-scheme: light) {
- .table-container,
- .summary-content {
- scrollbar-color: #cccccc #f0f0f0;
- }
-}
-
-@media (prefers-color-scheme: dark) {
- .table-container,
- .summary-content {
- scrollbar-color: #555555 #2c2c2c;
- }
-}
\ No newline at end of file
diff --git a/templates/mentor_table_template.html b/templates/mentor_table_template.html
deleted file mode 100644
index 0ebc9be..0000000
--- a/templates/mentor_table_template.html
+++ /dev/null
@@ -1,11 +0,0 @@
-
-
-
-
-
- Matching Mentors
-
-
- {table_content}
-
-
\ No newline at end of file
diff --git a/tests/integration/test_integration.py b/tests/integration/test_integration.py
index 7591710..2f8f6f9 100644
--- a/tests/integration/test_integration.py
+++ b/tests/integration/test_integration.py
@@ -16,49 +16,43 @@
@pytest.fixture
def setup_test_environment(tmp_path):
- """Creates a temporary directory structure and patches all file paths for sandboxing."""
+ """Creates a temporary directory structure and patches paths for sandboxing."""
# Define and create temporary paths
mentors_dir = tmp_path / "mentors"
mentees_dir = tmp_path / "mentees"
- output_dir = tmp_path / "output"
data_dir = tmp_path / "data"
db_dir = tmp_path / "db"
index_dir = db_dir / "test-embedding-model"
- for d in [mentors_dir, mentees_dir, output_dir, data_dir, db_dir, index_dir]:
+ for d in [mentors_dir, mentees_dir, data_dir, db_dir, index_dir]:
d.mkdir(exist_ok=True)
# Create dummy input files with text long enough to pass validation
- mentor1_text = "PRIYA PATEL Title: Assistant Professor. Her research focuses on the application of machine learning to surgical outcomes and developing new AI-driven diagnostic tools. She has extensive experience in Python, TensorFlow, and clinical data analysis."
- mentor2_text = "SOPHIA HALL Title: Professor. Her lab works on natural language processing and large language models. They are particularly interested in ethical AI and developing fair and unbiased algorithms. Looking for students with strong programming skills."
+ mentor1_text = "PRIYA PATEL, Title: Professor. Her research focuses on the application of machine learning to surgical outcomes and developing new AI-driven diagnostic tools. She has extensive experience in Python, TensorFlow, and clinical data analysis. Seeking motivated students."
+ mentor2_text = "SOPHIA HALL, Title: Assistant Professor. Her lab works on natural language processing and large language models. They are particularly interested in ethical AI and developing fair and unbiased algorithms. Looking for students with strong programming skills and a passion for NLP."
(mentors_dir / "mentor1.txt").write_text(mentor1_text)
(mentors_dir / "mentor2.txt").write_text(mentor2_text)
- (mentees_dir / "mentee1.txt").write_text("A mentee interested in AI.")
- # Patch all path variables
+ # Setup mentee directory
+ mentee1_dir = mentees_dir / "mentee1@test.com"
+ mentee1_dir.mkdir()
+ (mentee1_dir / "mentee1_cv.txt").write_text("A mentee interested in AI and NLP.")
+ (mentee1_dir / "mentee1.json").write_text(
+ json.dumps(
+ {
+ "first_name": "Test",
+ "last_name": "Mentee",
+ "research_Interest": ["AI", "NLP"],
+ "submissions_files": ["mentee1_cv.txt"],
+ }
+ )
+ )
+
+ # Patch path variables
paths_to_patch = {
"main.PATH_TO_MENTOR_DATA": str(data_dir / "mentor_data.csv"),
- "main.PATH_TO_SUMMARY": str(data_dir / "mentor_data_with_summaries.csv"),
- "main.PATH_TO_MENTOR_DATA_RANKED": str(
- data_dir / "mentor_data_summaries_ranks.csv"
- ),
- "main.INDEX_SUMMARY_WITH_METADATA": str(
- index_dir / "index_summary_with_metadata"
- ),
- "main.ROOT_DIR": str(tmp_path),
- "src.retrieval.build_index.paths.PATH_TO_SUMMARY": str(
- data_dir / "mentor_data_with_summaries.csv"
- ),
- "src.retrieval.build_index.paths.PATH_TO_MENTOR_DATA_RANKED": str(
- data_dir / "mentor_data_summaries_ranks.csv"
- ),
+ "main.INDEX_SUMMARY_WITH_METADATA": str(index_dir / "faiss_index"),
"src.retrieval.build_index.paths.INDEX_SUMMARY_WITH_METADATA": str(
- index_dir / "index_summary_with_metadata"
- ),
- "src.retrieval.build_index.paths.INDEX_SUMMARY_ASSISTANT_AND_ABOVE": str(
- index_dir / "index_summary_assistant_and_above"
- ),
- "src.retrieval.build_index.paths.INDEX_SUMMARY_ABOVE_ASSISTANT": str(
- index_dir / "index_summary_above_assistant"
+ index_dir / "faiss_index"
),
}
patchers = [patch(p, v) for p, v in paths_to_patch.items()]
@@ -68,7 +62,7 @@ def setup_test_environment(tmp_path):
"mentors_dir": str(mentors_dir),
"mentees_dir": str(mentees_dir),
"data_dir": str(data_dir),
- "db_dir": str(db_dir),
+ "mentor_data_path": paths_to_patch["main.PATH_TO_MENTOR_DATA"],
}
for p in patchers:
p.stop()
@@ -80,23 +74,22 @@ def mock_openai_embeddings(*args, **kwargs):
@pytest.mark.asyncio
@patch("main.summarize_cvs", new_callable=AsyncMock)
-@patch("src.retrieval.build_index.find_professor_type", return_value="Professor")
-@patch("src.retrieval.build_index.OpenAIEmbeddings", mock_openai_embeddings)
@patch("main.OpenAIEmbeddings", mock_openai_embeddings)
-async def test_data_pipeline_creates_files(
- mock_find_professor_type, mock_summarize_cvs, setup_test_environment
+@patch("src.retrieval.build_index.OpenAIEmbeddings", mock_openai_embeddings)
+async def test_data_pipeline_creates_and_enriches_single_csv(
+ mock_summarize_cvs, setup_test_environment
):
- """Tests that the data processing pipeline creates all the necessary intermediate files."""
+ """Tests that the pipeline creates and enriches a single mentor_data.csv."""
env = setup_test_environment
- async def mock_summarize_impl(input_path, output_path):
- df = pd.read_csv(input_path)
+ # Mock the summarization to add the 'Mentor_Summary' column
+ async def mock_summarize_impl(df):
df["Mentor_Summary"] = "Mocked Summary"
- df.to_csv(output_path, index=False, sep="\t")
+ return df
mock_summarize_cvs.side_effect = mock_summarize_impl
- # Run the pipeline up to the point of matching
+ # Run the full pipeline
await main_pipeline(
mentee_dir=env["mentees_dir"],
mentor_resume_dir=env["mentors_dir"],
@@ -104,14 +97,17 @@ async def mock_summarize_impl(input_path, output_path):
overwrite=True,
)
- # Assert that the key data files were created in the temp directory
- assert os.path.exists(os.path.join(env["data_dir"], "mentor_data.csv"))
- assert os.path.exists(
- os.path.join(env["data_dir"], "mentor_data_with_summaries.csv")
- )
- assert os.path.exists(
- os.path.join(env["data_dir"], "mentor_data_summaries_ranks.csv")
- )
+ # Assert that the single CSV was created and enriched
+ mentor_data_path = env["mentor_data_path"]
+ assert os.path.exists(mentor_data_path)
+
+ # Check the content of the final CSV
+ df = pd.read_csv(mentor_data_path, sep="\t")
+ assert "Mentor_Summary" in df.columns
+ assert "Professor_Type" in df.columns
+ assert "Rank" in df.columns
+ assert df.shape[0] == 2 # Two mentors were processed
+ assert pd.api.types.is_numeric_dtype(df["Rank"]) # Check for any numeric type
@pytest.mark.asyncio
@@ -131,10 +127,9 @@ async def test_matching_logic(
):
"""Tests the matching and evaluation logic with a fake, in-memory FAISS index."""
env = setup_test_environment
- mentee_cv_path = os.path.join(env["mentees_dir"], "mentee1", "mentee1.txt")
- os.makedirs(os.path.dirname(mentee_cv_path), exist_ok=True)
- with open(mentee_cv_path, "w") as f:
- f.write("A mentee interested in AI.")
+ mentee_cv_path = os.path.join(
+ env["mentees_dir"], "mentee1@test.com", "mentee1_cv.txt"
+ )
# Create a fake in-memory vector store
documents = [
@@ -162,5 +157,5 @@ async def test_matching_logic(
assert result["mentee_name"] == "Test Mentee"
assert len(result["matches"]) == 1
assert result["matches"][0]["Criterion Scores"]["Overall Match Quality"] == 9.5
- assert result["mentee_email"] == "mentee1"
+ assert result["mentee_email"] == "mentee1@test.com"
assert result["mentee_preferences"] == mentee_preferences
diff --git a/tests/unit/test_build_index.py b/tests/unit/test_build_index.py
index 3c3733f..9904f6b 100644
--- a/tests/unit/test_build_index.py
+++ b/tests/unit/test_build_index.py
@@ -1,100 +1,100 @@
import pytest
-import os
import pandas as pd
-from unittest.mock import MagicMock, patch, mock_open
+from unittest.mock import MagicMock, patch
import sys
+import os
# Add the project root to the Python path
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "..")))
+from src.retrieval.build_index import build_index
+from langchain_core.documents import Document
+
-# Mock the paths module before importing build_index
@pytest.fixture
-def mock_paths_fixture(tmp_path):
- mock_root_dir = tmp_path
- mock_data_dir = mock_root_dir / "data"
- mock_db_dir = mock_root_dir / "db"
- mock_data_dir.mkdir()
- mock_db_dir.mkdir()
-
- paths_dict = {
- "ROOT_DIR": str(mock_root_dir),
- "PATH_TO_SUMMARY": str(mock_data_dir / "mentor_data_with_summaries.csv"),
- "PATH_TO_MENTOR_DATA": str(mock_data_dir / "mentor_data.csv"),
- "PATH_TO_MENTOR_DATA_RANKED": str(
- mock_data_dir / "mentor_data_summaries_ranks.csv"
- ),
- "PROFESSOR_TYPES_PATH": str(mock_data_dir / "professor_types.txt"),
- "INDEX_SUMMARY_WITH_METADATA": str(mock_db_dir / "index_summary_with_metadata"),
- "INDEX_SUMMARY_ASSISTANT_AND_ABOVE": str(
- mock_db_dir / "index_summary_assistant_and_above"
- ),
- "INDEX_SUMMARY_ABOVE_ASSISTANT": str(
- mock_db_dir / "index_summary_above_assistant"
- ),
- }
-
- with patch(
- "src.retrieval.build_index.paths", MagicMock(**paths_dict)
- ) as mock_paths:
- yield mock_paths
-
-
-@patch("src.retrieval.build_index.load_dotenv")
-@patch("src.retrieval.build_index.ChatOpenAI")
-@patch("src.retrieval.build_index.OpenAIEmbeddings")
-@patch("src.retrieval.build_index.FAISS")
-@patch("src.retrieval.build_index.pd.read_csv")
-@patch("src.retrieval.build_index.os.path.exists")
-@patch("builtins.open", new_callable=mock_open)
-def test_main_build_index_flow_with_existing_ranked_data(
- mock_open_file,
- mock_os_path_exists,
- mock_read_csv,
- mock_faiss,
- mock_embeddings,
- mock_chat_openai,
- mock_load_dotenv,
- mock_paths_fixture,
-):
- # Arrange
- mock_os_path_exists.return_value = True
+def mock_paths(tmp_path):
+ """Fixture to mock the paths used in the build_index module."""
+ with patch("src.retrieval.build_index.paths") as mock_paths_patch:
+ mock_paths_patch.INDEX_SUMMARY_WITH_METADATA = str(tmp_path / "test_index")
+ yield mock_paths_patch
- ranked_df = pd.DataFrame(
+
+@pytest.fixture
+def sample_mentor_df():
+ """Fixture to create a sample mentor DataFrame for testing."""
+ return pd.DataFrame(
{
- "Mentor_Data": ["mentor1", "mentor2", "mentor3"],
- "Mentor_Profile": ["profile1", "profile2", "profile3"],
- "Mentor_Summary": ["summary1", "summary2", "summary3"],
- "Professor_Type": [
- "Professor",
- "Associate Professor",
- "Assistant Professor",
+ "Mentor_Summary": [
+ "Summary of a great mentor.",
+ "Summary of another mentor.",
],
- "Rank": [3, 2, 1],
+ "Mentor_Profile": ["profile1.pdf", "profile2.pdf"],
+ "Professor_Type": ["Professor", "Assistant Professor"],
+ "Rank": [3.0, 1.0],
}
)
- mock_read_csv.return_value = ranked_df
- mock_faiss_instance = MagicMock()
- mock_faiss.from_documents.return_value = mock_faiss_instance
- mock_faiss.from_texts.return_value = mock_faiss_instance
- # Act
- from src.retrieval.build_index import build_index
+@patch("src.retrieval.build_index.FAISS")
+@patch("src.retrieval.build_index.OpenAIEmbeddings")
+def test_build_index_creates_and_saves_vector_store(
+ mock_openai_embeddings, mock_faiss, sample_mentor_df, mock_paths
+):
+ """
+ Tests that build_index correctly processes a DataFrame, creates Documents,
+ initializes an embedding model, and creates and saves a FAISS vector store.
+ """
+ # Arrange
+ mock_embedding_instance = MagicMock()
+ mock_openai_embeddings.return_value = mock_embedding_instance
+
+ mock_vector_store_instance = MagicMock()
+ mock_faiss.from_documents.return_value = mock_vector_store_instance
- build_index()
+ # Act
+ build_index(sample_mentor_df)
# Assert
- mock_load_dotenv.assert_called_once()
- mock_chat_openai.assert_called_once()
- mock_os_path_exists.assert_called_once_with(
- mock_paths_fixture.PATH_TO_MENTOR_DATA_RANKED
+ # 1. Check if OpenAIEmbeddings was initialized correctly
+ mock_openai_embeddings.assert_called_once()
+
+ # 2. Check if FAISS.from_documents was called
+ mock_faiss.from_documents.assert_called_once()
+
+ # 3. Verify the structure of the documents passed to FAISS
+ call_args = mock_faiss.from_documents.call_args
+ passed_documents = call_args.kwargs["documents"]
+ assert len(passed_documents) == 2
+ assert isinstance(passed_documents[0], Document)
+ assert passed_documents[0].page_content == "Summary of a great mentor."
+ assert passed_documents[0].metadata["Rank"] == 3.0
+ assert passed_documents[1].page_content == "Summary of another mentor."
+ assert passed_documents[1].metadata["Professor_Type"] == "Assistant Professor"
+
+ # 4. Verify the correct embedding model was used
+ assert call_args.kwargs["embedding"] == mock_embedding_instance
+
+ # 5. Check if the vector store was saved to the correct path
+ mock_vector_store_instance.save_local.assert_called_once_with(
+ mock_paths.INDEX_SUMMARY_WITH_METADATA
)
- mock_read_csv.assert_called_once_with(
- mock_paths_fixture.PATH_TO_MENTOR_DATA_RANKED, sep="\t"
+
+
+def test_build_index_raises_error_on_missing_columns():
+ """
+ Tests that build_index raises a ValueError if the input DataFrame
+ is missing any of the required columns.
+ """
+ # Arrange
+ incomplete_df = pd.DataFrame(
+ {
+ "Mentor_Summary": ["A summary"],
+ # Missing "Mentor_Profile", "Professor_Type", "Rank"
+ }
)
- mock_embeddings.assert_called_once()
- assert mock_faiss.from_documents.call_count == 1
- assert mock_faiss.from_texts.call_count == 2
- assert mock_faiss_instance.save_local.call_count == 3
+ # Act & Assert
+ with pytest.raises(ValueError) as excinfo:
+ build_index(incomplete_df)
+
+ assert "must contain the following columns" in str(excinfo.value)
diff --git a/tests/unit/test_paths.py b/tests/unit/test_paths.py
index eb007f6..0d9beca 100644
--- a/tests/unit/test_paths.py
+++ b/tests/unit/test_paths.py
@@ -9,56 +9,36 @@
from src.config.model import EMBEDDING_MODEL
-def test_root_dir():
- # This test assumes that the ROOT_DIR is correctly set to the project root
- # which is two levels up from the src/config directory.
+def test_root_dir_is_correct():
+ """Tests that ROOT_DIR is correctly pointing to the project's root."""
+ # This assumes the test is run from within the project structure.
+ # The project root is two levels up from tests/unit.
expected_root_dir = os.path.abspath(
os.path.join(os.path.dirname(__file__), "..", "..")
)
assert paths.ROOT_DIR == expected_root_dir
-def test_path_to_summary():
- expected_path = os.path.join(paths.ROOT_DIR, "data/mentor_data_with_summaries.csv")
- assert paths.PATH_TO_SUMMARY == expected_path
+def test_data_dir_is_correct():
+ """Tests that DATA_DIR is correctly constructed."""
+ expected_path = os.path.join(paths.ROOT_DIR, "data")
+ assert paths.DATA_DIR == expected_path
-def test_path_to_mentor_data():
- expected_path = os.path.join(paths.ROOT_DIR, "data/mentor_data.csv")
+def test_path_to_mentor_data_is_correct():
+ """Tests that PATH_TO_MENTOR_DATA points to the correct file."""
+ expected_path = os.path.join(paths.DATA_DIR, "mentor_data.csv")
assert paths.PATH_TO_MENTOR_DATA == expected_path
-def test_path_to_summary_data():
- expected_path = os.path.join(paths.ROOT_DIR, "data/summary_data.csv")
- assert paths.PATH_TO_SUMMARY_DATA == expected_path
+def test_index_dir_is_dynamic():
+ """Tests that the INDEX_DIR is correctly created based on the embedding model."""
+ expected_path = os.path.join(paths.DB_DIR, EMBEDDING_MODEL)
+ assert paths.INDEX_DIR == expected_path
+ assert os.path.exists(paths.INDEX_DIR) # Should be created on import
-def test_path_to_mentor_data_ranked():
- expected_path = os.path.join(paths.ROOT_DIR, "data/mentor_data_summaries_ranks.csv")
- assert paths.PATH_TO_MENTOR_DATA_RANKED == expected_path
-
-
-def test_professor_types_path():
- expected_path = os.path.join(paths.ROOT_DIR, "data/professor_types.txt")
- assert paths.PROFESSOR_TYPES_PATH == expected_path
-
-
-def test_index_summary_with_metadata():
- expected_path = os.path.join(
- paths.ROOT_DIR, "db", EMBEDDING_MODEL, "index_summary_with_metadata"
- )
+def test_primary_faiss_index_path_is_correct():
+ """Tests that INDEX_SUMMARY_WITH_METADATA points to the correct file."""
+ expected_path = os.path.join(paths.INDEX_DIR, "faiss_index")
assert paths.INDEX_SUMMARY_WITH_METADATA == expected_path
-
-
-def test_index_summary_assistant_and_above():
- expected_path = os.path.join(
- paths.ROOT_DIR, "db", EMBEDDING_MODEL, "index_summary_assistant_and_above"
- )
- assert paths.INDEX_SUMMARY_ASSISTANT_AND_ABOVE == expected_path
-
-
-def test_index_summary_above_assistant():
- expected_path = os.path.join(
- paths.ROOT_DIR, "db", EMBEDDING_MODEL, "index_summary_above_assistant"
- )
- assert paths.INDEX_SUMMARY_ABOVE_ASSISTANT == expected_path