Persona-Driven Document Intelligence – Adobe Hackathon Round 1B

This project is designed as part of the Round 1B: Persona-Driven Document Intelligence challenge in the Adobe Hackathon 2025. The goal is to build a system that acts as an intelligent document analyst, extracting and prioritizing the most relevant sections from a collection of PDFs based on a specific persona and their job-to-be-done.

Problem Statement

Theme:

“Connect What Matters — For the User Who Matters”

Challenge Overview:

Given a document collection, persona definition, and a job-to-be-done, build a generalizable system that identifies and ranks the most relevant sections and sub-sections from the documents.

Input Specification:

Document Collection: 3–10 related PDF files
Persona: A role description with domain-specific expertise
Job-to-be-Done: A concrete task aligned with the persona

Documents can span domains including research papers, educational content, business reports, and more. The solution must be generic enough to handle this variety.

Sample Use Cases:

Researcher → Review methodologies from academic papers
Investment Analyst → Analyze R&D trends from annual reports
Student → Prepare for exams using chemistry chapters

Output Format

The system outputs a structured JSON containing:

Metadata
- Input documents
- Persona
- Job to be done
- Processing timestamp
Extracted Sections
- Document name
- Page number
- Section title
- Importance rank
Sub-section Analysis
- Document name
- Refined text
- Page number

Constraints

Must run offline (no internet)
Must use CPU only
Model size ≤ 1 GB
Total processing time ≤ 60 seconds for 3–5 PDFs

Solution Overview

This solution processes documents using a hybrid of heading-aware parsing and semantic similarity matching. Key features:

Embedding-Based Relevance Ranking: Uses multi-qa-MiniLM-L6-cos-v1 from SentenceTransformers to score document sections against the persona and job description.
Batch Inference for Speed: Optimized with batch embedding to meet timing constraints.
Generalization: Works across domains like academic papers, textbooks, and business reports.

Tech Stack

Python 3.9
PyMuPDF (for PDF parsing)
sentence-transformers for semantic embeddings
Docker for isolated execution

Run Instructions (via Docker)

Directory Structure

project/

├── input_pdfs/ # Folder containing input PDFs

├── input.json # Persona + job-to-be-done definition

├── output/ # Output folder for result JSON

├── main.py

├── requirements.txt

├── Dockerfile

Build Docker Image

docker build -t pak .

Run

docker run --rm -v "${PWD}/input.json:/app/input/input.json" -v "${PWD}/input_pdfs:/app/input" -v "${PWD}/output:/app/output" pak python main.py --input_json /app/input/input.json --input_dir /app/input --output /app/output/output.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Persona-Driven Document Intelligence – Adobe Hackathon Round 1B

Problem Statement

Theme:

Challenge Overview:

Input Specification:

Sample Use Cases:

Output Format

Constraints

Solution Overview

Tech Stack

Run Instructions (via Docker)

Directory Structure

Build Docker Image

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
input_pdfs		input_pdfs
output		output
Dockerfile		Dockerfile
README.md		README.md
input.json		input.json
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Persona-Driven Document Intelligence – Adobe Hackathon Round 1B

Problem Statement

Theme:

Challenge Overview:

Input Specification:

Sample Use Cases:

Output Format

Constraints

Solution Overview

Tech Stack

Run Instructions (via Docker)

Directory Structure

Build Docker Image

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages