Persona-Based Intelligent Section Extractor for PDFs
Develop a document intelligence system that:
- Accepts user-defined folders (collections) of PDFs.
- Reads a persona and job-to-be-done from a JSON file.
- Extracts and ranks the most relevant sections based on the persona/job.
- Runs fully offline and is Dockerized.
This system:
- Iterates over every
CollectionXfolder present (e.g.,Collection1,Collection2, ...). - Parses all PDFs inside the
PDFs/subfolder. - Extracts titles/headings/subsections that are contextually relevant.
- Ranks sections and outputs results to
challenge1b_output.json.
project_root/
├── main.py # Entry point — runs processing for all collections
├── src/
│ └── processor.py # Persona-based extraction and ranking logic
├── Collection1/
│ ├── PDFs/ # Folder containing input PDFs
│ ├── challenge1b_input.json # Persona & job definition
│ └── challenge1b_output.json # Output written here after processing
├── Collection2/
│ ├── PDFs/
│ ├── challenge1b_input.json
│ └── challenge1b_output.json
├── Dockerfile # Offline Docker image
├── requirements.txt # Python dependencies
├── README.md # This file✅ You can create as many collections as needed — the script auto-detects and processes each one.
{
"persona": "Investment Analyst",
"job": "Analyze revenue trends, R&D investments, and market positioning strategies"
}{
"metadata": {
"input_documents": ["report1.pdf", "report2.pdf"],
"persona": "Investment Analyst",
"job_to_be_done": "Analyze revenue trends, R&D investments, and market positioning strategies",
"processing_timestamp": "2025-07-15T12:00:00Z"
},
"extracted_sections": [
{
"document": "report1.pdf",
"page_number": 3,
"section_title": "Revenue Growth Breakdown",
"importance_rank": 1
}
],
"subsection_analysis": [
{
"document": "report1.pdf",
"page_number": 3,
"refined_text": "Revenue increased due to product diversification..."
}
]
}docker build -t persona-section-extractor .docker run --rm -v $(pwd):/app --network none persona-section-extractordocker run --rm -v "${PWD}:/app" --network none persona-section-extractor✅ This command:
- Mounts the current project directory into the container.
- Disables all internet access (
--network none). - Automatically processes all
CollectionX/folders (e.g.,Collection1,Collection2, ...). - Writes output JSONs back into each respective collection directory.
Everything is already included in the Docker image.
If you want to install dependencies locally:
pip install -r requirements.txt- Add your own
CollectionX/folder (e.g.,Collection4/). - Add a few PDFs inside the
PDFs/subfolder. - Add a
challenge1b_input.jsondefining the persona/job. - Run the Docker container again.
- Check
challenge1b_output.jsonfor results.
Collection1/ # Travel Planning
├── PDFs/
├── challenge1b_input.json
├── challenge1b_output.json
Collection2/ # Adobe Acrobat Learning
├── PDFs/
├── challenge1b_input.json
├── challenge1b_output.json
Collection3/ # Recipes
├── PDFs/
├── challenge1b_input.json
├── challenge1b_output.jsonSalusha — Participant, Adobe Hackathon 2025
GitHub: https://github.com/salusha
Snehal Taori — Participant, Adobe Hackathon 2025
GitHub: https://github.com/snehaltaori
Deepanshi Verma — Participant, Adobe Hackathon 2025
GitHub: https://github.com/DeepanshiiVerma