This project is designed as part of the Round 1B: Persona-Driven Document Intelligence challenge in the Adobe Hackathon 2025. The goal is to build a system that acts as an intelligent document analyst, extracting and prioritizing the most relevant sections from a collection of PDFs based on a specific persona and their job-to-be-done.
“Connect What Matters — For the User Who Matters”
Given a document collection, persona definition, and a job-to-be-done, build a generalizable system that identifies and ranks the most relevant sections and sub-sections from the documents.
- Document Collection: 3–10 related PDF files
- Persona: A role description with domain-specific expertise
- Job-to-be-Done: A concrete task aligned with the persona
Documents can span domains including research papers, educational content, business reports, and more. The solution must be generic enough to handle this variety.
- Researcher → Review methodologies from academic papers
- Investment Analyst → Analyze R&D trends from annual reports
- Student → Prepare for exams using chemistry chapters
The system outputs a structured JSON containing:
-
Metadata
- Input documents
- Persona
- Job to be done
- Processing timestamp
-
Extracted Sections
- Document name
- Page number
- Section title
- Importance rank
-
Sub-section Analysis
- Document name
- Refined text
- Page number
- Must run offline (no internet)
- Must use CPU only
- Model size ≤ 1 GB
- Total processing time ≤ 60 seconds for 3–5 PDFs
This solution processes documents using a hybrid of heading-aware parsing and semantic similarity matching. Key features:
- Embedding-Based Relevance Ranking: Uses
multi-qa-MiniLM-L6-cos-v1from SentenceTransformers to score document sections against the persona and job description. - Batch Inference for Speed: Optimized with batch embedding to meet timing constraints.
- Generalization: Works across domains like academic papers, textbooks, and business reports.
- Python 3.9
- PyMuPDF (for PDF parsing)
sentence-transformersfor semantic embeddings- Docker for isolated execution
project/
├── input_pdfs/ # Folder containing input PDFs
├── input.json # Persona + job-to-be-done definition
├── output/ # Output folder for result JSON
├── main.py
├── requirements.txt
├── Dockerfile
docker build -t pak .
- Run
docker run --rm -v "${PWD}/input.json:/app/input/input.json" -v "${PWD}/input_pdfs:/app/input" -v "${PWD}/output:/app/output" pak python main.py --input_json /app/input/input.json --input_dir /app/input --output /app/output/output.json