Skip to content

YugVarshney/Adobe_Hackathon_Round_1B

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Persona-Driven Document Intelligence – Adobe Hackathon Round 1B

This project is designed as part of the Round 1B: Persona-Driven Document Intelligence challenge in the Adobe Hackathon 2025. The goal is to build a system that acts as an intelligent document analyst, extracting and prioritizing the most relevant sections from a collection of PDFs based on a specific persona and their job-to-be-done.


Problem Statement

Theme:

“Connect What Matters — For the User Who Matters”

Challenge Overview:

Given a document collection, persona definition, and a job-to-be-done, build a generalizable system that identifies and ranks the most relevant sections and sub-sections from the documents.

Input Specification:

  1. Document Collection: 3–10 related PDF files
  2. Persona: A role description with domain-specific expertise
  3. Job-to-be-Done: A concrete task aligned with the persona

Documents can span domains including research papers, educational content, business reports, and more. The solution must be generic enough to handle this variety.

Sample Use Cases:

  • Researcher → Review methodologies from academic papers
  • Investment Analyst → Analyze R&D trends from annual reports
  • Student → Prepare for exams using chemistry chapters

Output Format

The system outputs a structured JSON containing:

  1. Metadata

    • Input documents
    • Persona
    • Job to be done
    • Processing timestamp
  2. Extracted Sections

    • Document name
    • Page number
    • Section title
    • Importance rank
  3. Sub-section Analysis

    • Document name
    • Refined text
    • Page number

Constraints

  • Must run offline (no internet)
  • Must use CPU only
  • Model size ≤ 1 GB
  • Total processing time ≤ 60 seconds for 3–5 PDFs

Solution Overview

This solution processes documents using a hybrid of heading-aware parsing and semantic similarity matching. Key features:

  • Embedding-Based Relevance Ranking: Uses multi-qa-MiniLM-L6-cos-v1 from SentenceTransformers to score document sections against the persona and job description.
  • Batch Inference for Speed: Optimized with batch embedding to meet timing constraints.
  • Generalization: Works across domains like academic papers, textbooks, and business reports.

Tech Stack

  • Python 3.9
  • PyMuPDF (for PDF parsing)
  • sentence-transformers for semantic embeddings
  • Docker for isolated execution

Run Instructions (via Docker)

Directory Structure

project/

├── input_pdfs/ # Folder containing input PDFs

├── input.json # Persona + job-to-be-done definition

├── output/ # Output folder for result JSON

├── main.py

├── requirements.txt

├── Dockerfile

Build Docker Image

docker build -t pak .

  • Run

docker run --rm -v "${PWD}/input.json:/app/input/input.json" -v "${PWD}/input_pdfs:/app/input" -v "${PWD}/output:/app/output" pak python main.py --input_json /app/input/input.json --input_dir /app/input --output /app/output/output.json

About

Persona-Driven Document Intelligence – Offline system that extracts, ranks, and refines the most relevant PDF sections based on a persona and their job-to-be-done. Built for Adobe Hackathon 2025 (Round 1B) with heading-aware parsing, semantic embeddings, and batch inference.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors