A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy
Pathfinder-Lite, a variation of our original Pathfinder, is a modular, retrieval-augmented generation (RAG) tool for doing semantic search and question-answering on a large corpus of astronomy papers. It leverages modern Large Language Models (LLMs), vector embeddings, and optional re-ranking to deliver relevant papers and concise answers to astronomy-related queries.
- Project Overview
- Features
- Project Structure
- Quickstart
- Detailed Explanation
- Using Custom Datasets
- Contributing
- License
- Recent Improvements
- Acknowledgments
Pathfinder is a modular, retrieval-augmented generation (RAG) system designed for astronomy research. It enables natural-language queries over a large corpus of astronomy papers, retrieves semantically relevant documents, and uses LLMs to generate context-aware answers.
Key capabilities:
- Semantic Search: Find papers based on meaning, not just keywords
- Advanced Retrieval: HyDE (Hypothetical Document Embeddings) for improved search accuracy
- Adaptive Weighting: Weight results by keywords, recency, or citation count
- Reranking: Improve retrieval quality using Cohere's reranking model
- Multiple RAG Modes: Choose appropriate prompt types for different question types
- Deep Research: Break complex questions into sub-questions for comprehensive answers
- Consensus Evaluation: Assess agreement level among retrieved papers
- Question Type Detection: Automatically detect optimal processing for different questions
- Multi-Model Support: Flexible support for different models like GPT-4o and various Azure OpenAI offerings
- Robust Error Handling: Fallback mechanisms to ensure reliability when services fail
Pathfinder complements existing tools like NASA ADS or arXiv search by allowing free-form queries and advanced summarization of results. Its modular architecture makes it adaptable to other domains beyond astronomy.
- Modular code: Separated into retrieval, pipeline, and UI layers.
- Single config: All credentials and environment variables kept in
config.yml. - Flexible: Swap in different LLMs, embedding models, or re-rankers.
- Multiple frontends: Gradio, Slack bots, etc., can hook into the same underlying library.
- Extensible: Add scripts to build or update FAISS indexes, advanced pipelines, or custom ranking logic.
- Advanced retrieval methods: HyDE (Hypothetical Document Embeddings), Cohere reranking, and weighted scoring by keywords, publication date, and citation count.
A recommended layout:
pathfinder/
├── LICENSE
├── README.md
├── requirements.txt
├── config.yml
├── data/
│ ├── ...
│ └── (local data files, e.g. FAISS indexes)
├── src/
│ ├── __init__.py
│ ├── config.py
│ ├── providers.py
│ ├── nlp_utils.py
│ ├── run_pathfinder.py
│ ├── retrieval/
│ │ ├── __init__.py
│ │ ├── retrieval_system.py
│ ├── prompts.py
│ └── pipeline/
│ ├── __init__.py
│ ├── rag_pipeline.py
├── app/
│ ├── app_gradio.py
│ └── ...
├── scripts/
│ ├── embeddings.py
│ ├── pathfinder_dataset_loader.py
│ ├── useful_api_calls.py
│ └── any_other_preprocessing.py
└── tests/
├── test_providers.py
├── test_embeddings.py
└── ...
-
Clone this repo:
git clone git@github.com:UniverseTBD/Pathfinder.git cd pathfinder -
Create and activate a virtual environment (optional but recommended):
python3 -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
Copy config.yml.template to config.yml and edit it with your credentials:
# config.yml
embedding_base_url: "https://your-azure-endpoint.openai.azure.com"
embedding_api_key: "your-api-key"
embedding_deployment: "text-embedding-3-small"
embedding_deployment_name: "text-embedding-3-small"
embedding_api_version: "2023-05-15"
# GPT-4o-mini Configuration
chat_base_url_4omini: "https://your-gpt4o-azure-endpoint.openai.azure.com"
chat_api_key_4omini: "your-gpt4o-api-key"
chat_deployment_name_4omini: "gpt-4o-mini"
chat_api_version_4omini: "2025-01-01-preview"You can use Pathfinder directly from the command line:
# Basic usage
python -m src.run_pathfinder "What is dark matter?"
# Specify number of papers to retrieve
python -m src.run_pathfinder "What is dark matter?" --top-k 5
# Specify additional keywords to guide the search
python -m src.run_pathfinder "What is dark matter?" --keywords "galaxy,rotation"
# Specify the prompt type
python -m src.run_pathfinder "What is dark matter?" --prompt-type "Multi-paper"
# Specify the retrieval method
python -m src.run_pathfinder "What is dark matter?" --rag-type "Semantic Search"
# Full example with all options
python -m src.run_pathfinder "What is dark matter?" --top-k 10 --keywords "galaxy,rotation" --toggles Keywords Time --prompt-type "Multi-paper" --rag-type "Semantic + HyDE + CoHERE"--top-k: Number of papers to retrieve (default: 10)--keywords: Additional keywords to guide the search, comma-separated--toggles: Weight toggles, can be "Keywords", "Time", or "Citations"--prompt-type: Type of prompt to use- Options: "Auto", "Single-paper", "Multi-paper", "Bibliometric", "Broad but nuanced", "Deep Research (BETA)"
--rag-type: Type of retrieval method to use- Options: "Semantic Search", "Semantic + HyDE", "Semantic + CoHERE", "Semantic + HyDE + CoHERE"
- Gradio version:
python -m src.app.app_gradio
The configuration system consists of:
config.yml: Stores all secrets and environment variablessrc/config.py: Loads and exposes the configuration
Example config.py:
import yaml, os
CONFIG_PATH = os.path.join(os.path.dirname(__file__), "..", "config.yml")
def load_config():
with open(CONFIG_PATH, "r") as f:
return yaml.safe_load(f)
config = load_config()src/providers.py centralizes LLM and embedding model initialization, focusing on gpt-4o-mini for all LLM operations:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from src.config import config
def get_openai_chat_llm(deployment_name=None, temperature=0.0):
"""Initialize Azure OpenAI Chat model (always using gpt-4o-mini)."""
# Always use gpt-4o-mini for all operations
llm = AzureChatOpenAI(
azure_endpoint=config["chat_base_url_4omini"],
azure_deployment=config["chat_deployment_name_4omini"],
api_version=config["chat_api_version_4omini"],
api_key=config["chat_api_key_4omini"],
temperature=temperature,
)
return llm
def get_openai_embeddings():
return AzureOpenAIEmbeddings(
azure_endpoint=config["embedding_base_url"],
deployment=config["embedding_deployment_name"],
api_key=config["embedding_api_key"],
api_version=config["embedding_api_version"],
)src/embeddings.py handles vector embeddings and FAISS operations:
from src.llm_providers import get_openai_embeddings
class EmbeddingService:
def __init__(self):
self.embeddings = get_openai_embeddings()
def embed_text(self, text: str):
return self.embeddings.embed_query(text)The retrieval system (src/retrieval/retrieval_system.py) manages semantic search and document ranking:
from src.retrieval.retrieval_system import RetrievalSystem
# Initialize the retrieval system
retrieval = RetrievalSystem()
# Basic retrieval
results, papers_df = retrieval.retrieve(
query="How are exoplanets detected?",
top_k=10
)
# Advanced retrieval with HyDE and reranking
results, papers_df = retrieval.retrieve(
query="What is the current understanding of dark energy?",
top_k=10,
toggles=["Keywords", "Time", "Citations"], # Weight by keywords, recency, and citation count
use_hyde=True, # Use hypothetical document embeddings
use_rerank=True, # Use Cohere reranking
hyde_temperature=0.5, # Control HyDE generation creativity
rerank_top_k=250, # Number of candidates for reranking
max_doclen=250, # Max length of generated HyDE document
generate_n=1, # Number of HyDE documents to generate
)The RetrievalSystem supports multiple retrieval methods and weighting options:
- HyDE (Hypothetical Document Embeddings): Generates a hypothetical document that answers the query, and uses its embedding to find similar papers.
- Cohere Reranking: Re-ranks initial retrieval results using Cohere's reranking model.
- Weighting Options:
- Keywords: Weights results by keyword match with query
- Time: Weights results by recency
- Citations: Weights results by citation count
The pipeline module contains the reasoning systems built on top of retrieval:
# RAG Pipeline (simple question answering)
from src.pipeline.rag_pipeline import run_rag_qa
result = run_rag_qa(
query="What is the Hubble constant?",
papers_df=papers_dataframe,
question_type="Multi-paper" # or "Single-paper", "Bibliometric", "Broad but nuanced"
)
answer = result["answer"]
# Deep Research Pipeline (complex questions)
from src.pipeline.deep_research_pipeline import deep_research
papers_df, result = deep_research(
question="How does dark matter affect galaxy formation and evolution?",
top_k=10, # papers per sub-question
retrieval_system=retrieval_system
)
complex_answer = result["answer"]The Deep Research pipeline breaks down complex questions into atomic sub-questions, researches each one, and then synthesizes a comprehensive answer.
The consensus evaluation module analyzes the agreement level between retrieved papers and the query:
from src.consensus_evaluation import evaluate_overall_consensus
# Get consensus evaluation from abstracts
consensus = evaluate_overall_consensus(
query="Are exoplanets common in our galaxy?",
abstracts=["Abstract 1...", "Abstract 2...", "Abstract 3..."]
)
print(f"Consensus: {consensus.consensus}")
print(f"Explanation: {consensus.explanation}")
print(f"Relevance score: {consensus.relevance_score}")The main run_pathfinder function combines all components:
from src.run_pathfinder import run_pathfinder
# Run the complete Pathfinder system
result = run_pathfinder(
query="What is the evidence for dark matter?",
top_k=10, # Number of papers to retrieve
extra_keywords="galaxy,rotation", # Optional keywords to guide search
toggles=["Keywords", "Time"], # Weight by keywords and recency
prompt_type="Auto", # Auto-detect question type or choose specific type
rag_type="Semantic + HyDE + CoHERE" # Retrieval method
)
# Access results
papers_df = result["papers"] # Retrieved papers
answer = result["answer"] # Generated answer
question_type = result["question_type"] # Detected question type
consensus = result["consensus"] # Consensus evaluationThe app layer provides user interfaces using Streamlit or Gradio:
# app/app_gradio.py
import gradio as gr
import pandas as pd
from src.run_pathfinder import run_pathfinder
def process_query(query):
result = run_pathfinder(query=query)
consensus = result["consensus"] if result["consensus"] else ""
return result["answer"], result["papers"].to_html(), consensus
def main():
with gr.Blocks() as demo:
gr.Markdown("# Pathfinder Demo")
with gr.Row():
query = gr.Textbox(label="Ask a question:")
with gr.Row():
submit_btn = gr.Button("Submit")
with gr.Row():
answer_box = gr.Markdown(label="Answer")
with gr.Row():
papers_html = gr.HTML(label="Retrieved Papers")
with gr.Row():
consensus_box = gr.Markdown(label="Consensus")
submit_btn.click(process_query, inputs=query, outputs=[answer_box, papers_html, consensus_box])
demo.launch()
if __name__ == "__main__":
main()Contains utility scripts for tasks like building FAISS indexes:
python scripts/build_faiss_index.pyThe app layer provides a comprehensive test suite using pytest:
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_retrieval_system.py
# Run with verbose output
pytest tests/ -vEach component has dedicated test files that verify functionality and handle edge cases. The test suite includes:
- Unit tests for individual functions and classes
- Integration tests for component interactions
- Fixtures for consistent test environments
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License.
Pathfinder can be adapted to work with your own dataset. Here's a guide to integrating a custom corpus:
Your dataset should follow the structure expected by Hugging Face's dataset library, with the following fields:
title: Document titleabstract: Document content or summaryyear: Publication year (numeric)authors: List of authorscitations: Citation count (optional, for citation-based weighting)embed: Pre-computed embeddings (will be added in step 2)
You can start with a CSV or JSON file containing these fields (except for embeddings which will be computed later).
Use the standalone script in the scripts/ directory to process your custom dataset:
# scripts/embeddings.py
import os
import yaml
import faiss
import numpy as np
import pandas as pd
from typing import List, Optional
from datasets import Dataset
from tqdm import tqdm
from pathlib import Path
from numpy.typing import NDArray
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
# Load config for embedding model
def load_config():
config_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "config.yml")
with open(config_path, "r") as f:
return yaml.safe_load(f)
# Get embedding model from config
def get_openai_embeddings():
config = load_config()
return AzureOpenAIEmbeddings(
azure_endpoint=config["embedding_base_url"],
deployment=config["embedding_deployment_name"],
api_key=config["embedding_api_key"],
api_version=config["embedding_api_version"],
)
class EmbeddingService:
EMBEDDING_DIM = 1536 # OpenAI's embedding dimension
def __init__(self):
self.embeddings = get_openai_embeddings()
self.index: Optional[faiss.Index] = None
def embed_text(self, text: str) -> NDArray[np.float32]:
"""Generate embedding for a single text."""
if not text:
return np.zeros(self.EMBEDDING_DIM, dtype=np.float32)
embedding = self.embeddings.embed_query(text)
return np.array(embedding, dtype=np.float32)
def embed_text_batch(self, texts: List[str]) -> List[NDArray[np.float32]]:
"""Generate embeddings for multiple texts."""
if not texts:
return []
embeddings = self.embeddings.embed_documents(texts)
return [np.array(emb, dtype=np.float32) for emb in embeddings]
def create_index(self, dimension: int = EMBEDDING_DIM) -> faiss.Index:
"""Create a new FAISS index."""
self.index = faiss.IndexFlatL2(dimension)
return self.index
# Functions for processing custom datasets
def load_custom_data(data_path):
"""Load data from CSV, JSON, or other format."""
if data_path.endswith('.csv'):
df = pd.read_csv(data_path)
elif data_path.endswith('.json'):
df = pd.read_json(data_path)
else:
raise ValueError(f"Unsupported file format: {data_path}")
return df
def process_dataset(df):
"""Process and validate the dataset."""
required_columns = ['title', 'abstract', 'year', 'authors']
for col in required_columns:
if col not in df.columns:
raise ValueError(f"Missing required column: {col}")
# Convert to dataset
dataset = Dataset.from_pandas(df)
return dataset
def create_embeddings(dataset, batch_size=32):
"""Create embeddings for all documents in the dataset."""
embedding_service = EmbeddingService()
all_embeddings = []
# Process in batches to avoid memory issues
for i in tqdm(range(0, len(dataset), batch_size)):
batch = dataset[i:i+batch_size]
texts = [doc['abstract'] for doc in batch]
# Generate embeddings for this batch
embeddings = embedding_service.embed_text_batch(texts)
all_embeddings.extend(embeddings)
# Add embeddings to dataset
dataset = dataset.add_column('embed', all_embeddings)
return dataset
def build_and_save_index(dataset, output_dir):
"""Build FAISS index and save dataset."""
os.makedirs(output_dir, exist_ok=True)
# Add FAISS index
dataset.add_faiss_index(column='embed')
# Save dataset and metadata
dataset.save_to_disk(os.path.join(output_dir, 'dataset'))
# Create metadata for retrieval
metadata = {
'dataset_name': 'custom',
'count': len(dataset),
'created_date': pd.Timestamp.now().isoformat()
}
with open(os.path.join(output_dir, 'metadata.pkl'), 'wb') as f:
import pickle
pickle.dump(metadata, f)
print(f"Dataset and index saved to {output_dir}")
return dataset
def main():
import argparse
parser = argparse.ArgumentParser(description="Build a custom dataset with FAISS index")
parser.add_argument("--input", required=True, help="Path to input data file (CSV, JSON)")
parser.add_argument("--output", required=True, help="Directory to save processed dataset and index")
parser.add_argument("--batch-size", type=int, default=32, help="Batch size for embedding generation")
args = parser.parse_args()
print(f"Loading data from {args.input}")
df = load_custom_data(args.input)
print(f"Processing dataset with {len(df)} documents")
dataset = process_dataset(df)
print("Creating embeddings (this may take a while)...")
dataset_with_embeds = create_embeddings(dataset, batch_size=args.batch_size)
print("Building and saving FAISS index...")
build_and_save_index(dataset_with_embeds, args.output)
print("Done!")
if __name__ == "__main__":
main()Run the script to process your data and create the necessary FAISS index:
# Create a virtual environment if you haven't already
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Process your custom dataset
python scripts/custom_dataset_builder.py --input your_data.csv --output data/custom_datasetThis will:
- Load your custom dataset
- Generate embeddings for each document using the configured embedding model
- Build a FAISS index for fast similarity search
- Save the dataset and index to the specified output location
Create a custom version of the retrieval system that works with your local dataset:
# scripts/custom_retrieval.py
import os
import yaml
import numpy as np
import pandas as pd
from datasets import load_from_disk
import cohere
# Load config for API keys
def load_config():
config_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), "config.yml")
with open(config_path, "r") as f:
return yaml.safe_load(f)
config = load_config()
class CustomRetrievalSystem:
"""Retrieval system for custom datasets saved with the embeddings.py script."""
def __init__(
self,
dataset_path,
column_name="embed",
):
"""Initialize with a local dataset instead of from Hugging Face."""
# Dataset info
self.dataset_path = dataset_path
self.column_name = column_name
# Load dataset from disk
self.dataset = load_from_disk(self.dataset_path)
self.dataset.add_faiss_index(column=self.column_name)
print(f"Loaded custom dataset from '{self.dataset_path}' with FAISS index.")
# Set up cohere client for reranking if available
self.cohere_key = config.get("cohere_api_key", os.environ.get("cohere_key", ""))
if self.cohere_key:
print("Initialized Cohere client for reranking")
self.cohere_client = cohere.Client(self.cohere_key)
else:
print("No Cohere API key found, reranking will not be available")
self.cohere_client = None
# Get embedding model
from scripts.embeddings import get_openai_embeddings
self.embedding_model = get_openai_embeddings()
# Weight toggles
self.weight_keywords = False
self.weight_date = False
self.weight_citation = False
def retrieve(self, query, top_k=10, use_rerank=False):
"""Retrieve documents relevant to the query."""
# Generate embedding for the query
query_embedding = self.embedding_model.embed_query(query)
# Perform FAISS search
scores, indices = self.dataset.get_nearest_examples(
self.column_name, query_embedding, k=top_k
)
# Convert to dataframe for easier handling
results = []
for i, idx in enumerate(indices):
item = self.dataset[idx]
results.append({
"title": item["title"],
"abstract": item["abstract"],
"year": item["year"],
"authors": item["authors"],
"score": scores[i],
# Add other fields as available in your dataset
})
results_df = pd.DataFrame(results)
# Optional reranking with Cohere
if use_rerank and self.cohere_client:
rerank_results = self.cohere_client.rerank(
query=query,
documents=results_df["abstract"].tolist(),
top_n=top_k,
model="rerank-english-v2.0"
)
# Update scores based on reranking
reranked_indices = [r.index for r in rerank_results]
reranked_df = results_df.iloc[reranked_indices].copy()
reranked_df["score"] = [r.relevance_score for r in rerank_results]
return reranked_df
return results_dfCreate a simple script to run queries against your custom dataset:
# scripts/run_custom_query.py
import argparse
import pandas as pd
from custom_retrieval import CustomRetrievalSystem
def main():
parser = argparse.ArgumentParser(description="Query a custom Pathfinder dataset")
parser.add_argument("query", help="The query to search for")
parser.add_argument("--dataset", default="data/custom_dataset/dataset",
help="Path to the dataset directory")
parser.add_argument("--top-k", type=int, default=10,
help="Number of results to return")
parser.add_argument("--use-rerank", action="store_true",
help="Use Cohere reranking")
args = parser.parse_args()
# Initialize retrieval system
retrieval = CustomRetrievalSystem(dataset_path=args.dataset)
# Retrieve results
results = retrieval.retrieve(
query=args.query,
top_k=args.top_k,
use_rerank=args.use_rerank
)
# Display results
print(f"\nTop {len(results)} results for query: '{args.query}'\n")
for i, row in results.iterrows():
print(f"{i+1}. {row['title']} ({row['year']})")
print(f" Authors: {row['authors']}")
print(f" Score: {row['score']:.4f}")
print(f" Abstract: {row['abstract'][:200]}...")
print()
if __name__ == "__main__":
main()Now you can use your custom dataset with a simple command:
# Run a simple query
python scripts/run_custom_query.py "Your query here"
# Use more options
python scripts/run_custom_query.py "Your query here" --dataset data/your_custom_dataset/dataset --top-k 5 --use-rerankIf you want to integrate your custom dataset with the existing Pathfinder system:
- Add your custom dataset path to your
config.yml:
# Add to your config.yml
custom_dataset:
path: "data/custom_dataset/dataset" # Path to the saved dataset
use_custom: true # Flag to use custom dataset- Create a custom retrieval system in the src directory:
# src/retrieval/custom_retrieval_system.py
from src.retrieval.retrieval_system import RetrievalSystem
from datasets import load_from_disk
class CustomRetrievalSystem(RetrievalSystem):
def __init__(
self,
dataset_path,
column_name="embed",
):
"""Initialize with a local dataset."""
# Call parent __init__ but we'll override the dataset
super().__init__()
# Override dataset loading
self.dataset = load_from_disk(dataset_path)
self.dataset.add_faiss_index(column=column_name)
print(f"Loaded custom dataset from '{dataset_path}'")- Modify the main run script:
# In src/run_pathfinder.py
from src.retrieval.custom_retrieval_system import CustomRetrievalSystem
def run_pathfinder(query, use_custom_dataset=False, **kwargs):
if use_custom_dataset:
dataset_path = config.get("custom_dataset", {}).get("path")
if dataset_path:
retrieval_system = CustomRetrievalSystem(dataset_path=dataset_path)
else:
retrieval_system = RetrievalSystem() # Fallback to default
else:
retrieval_system = RetrievalSystem()
# Continue with existing code...- Run Pathfinder with your custom dataset:
python -m src.run_pathfinder "Your query here" --use-custom-datasetThe latest version of Pathfinder includes several significant improvements:
-
Streamlined Model Integration: Now exclusively using gpt-4o-mini for all language model operations, providing optimal performance and consistency.
-
Enhanced UI: Completely redesigned Gradio interface with a modern dark theme for better readability and user experience.
-
Improved Project Structure: Reorganized repository structure with cleaner separation of concerns and more intuitive file layout.
-
Enhanced Retrieval: Refined Cohere reranking to improve search result quality and relevance.
-
Structured Output Generation: Implemented robust structured output with fallback mechanisms for consistent results.
-
More Robust Error Handling: Added comprehensive error handling throughout the system to gracefully recover from failures.
-
Command-line Interface: Streamlined command-line interface with clear documentation for all available options.
-
Custom Dataset Support: Added support for integrating and searching custom document collections.
- Thanks to the UniverseTBD and JHU JSALT 2024 Evals LLMs teams for Astro teams for development support
- Thanks to all open-source libraries used in this project
- Special thanks to early testers and contributors
Disclaimer: Pathfinder is meant to complement, not replace, services like arXiv or NASA ADS. Always validate LLM-generated text against primary sources.