Challenge 1B: Multi-Document Analysis with Persona-Based Intelligence

Intelligent document analysis that connects what matters for the user who matters

Project Overview

The solution addresses the "Connect What Matters — For the User Who Matters" challenge by implementing a persona-driven document intelligence system.

This solution implements an advanced document intelligence system capable of analyzing multiple PDF collections based on specific personas and task requirements. The system extracts, ranks, and presents the most relevant content across document sets. It processes collections of PDFs and extracts the most relevant information based on:

Persona Definition: Specific role with expertise and focus areas
Job-to-be-Done: Concrete task the persona needs to accomplish

The system handles diverse document types, formats, and content structures while maintaining high relevance accuracy and performance.

Read our detailed approach explanation here

Technical Architecture

The implementation follows a modular design with three primary components:

PDF Section Extractor (pdf_section_extractor.py)
- Document type classification (text-based, image-based, hybrid)
- Section extraction with hierarchical structure detection
- Content cleaning and normalization
Semantic Analyzer (semantic_analyzer.py)
- Sentence embeddings for semantic similarity analysis
- TF-IDF vectorization as fallback mechanism
- Multi-tiered extraction strategy based on document complexity
Persona-Driven Analyzer (process_persona.py)
- Collection processing with multi-document intelligence
- Relevance ranking and prioritization
- Output generation in standardized JSON format

Key Features

Adaptive Document Processing: Automatically adapts to different document types and structures
Multi-Document Intelligence: Analyzes connections between documents in a collection
Tiered Extraction Strategy: Balances speed and accuracy based on document complexity
Robust Fallback Mechanisms: Ensures reliable performance across diverse PDF types
Parallel Processing Support: Handles multiple document collections efficiently

For more details on our implementation approach and methodology, see our approach explanation document.

Installation & Usage

Prerequisites

Docker with AMD64 support
At least 8 CPUs and 16GB RAM recommended

Building the Image

docker build --platform linux/amd64 -t connect-the-dots-pdf-challenge-1b:latest .

Running the Solution

The solution processes each collection individually. For each collection, run:

docker run --rm \
  -v "$(pwd)/Collection 1:/app/input" \
  -v "$(pwd)/Collection 1:/app/output" \
  --network none \
  connect-the-dots-pdf-challenge-1b:latest

Replace Collection 1 with the appropriate collection directory (Collection 2, Collection 3, etc.).

The container processes the collection found in the /app/input directory and generates the output in /app/output. Note that both input and output are mapped to the same collection directory for convenience.

Configuration

Input JSON structure:

{
  "challenge_info": {
    "challenge_id": "round_1b_XXX",
    "test_case_name": "specific_test_case"
  },
  "documents": [{"filename": "doc.pdf", "title": "Document Title"}],
  "persona": {"role": "User Persona"},
  "job_to_be_done": {"task": "Task description"}
}

Output JSON structure:

{
  "metadata": {
    "input_documents": ["doc1.pdf", "doc2.pdf"],
    "persona": "User Persona",
    "job_to_be_done": "Task description",
    "processing_timestamp": "2025-07-27T10:30:15.123456"
  },
  "extracted_sections": [
    {
      "document": "doc1.pdf",
      "section_title": "Important Section",
      "importance_rank": 1,
      "page_number": 3
    }
  ],
  "subsection_analysis": [
    {
      "document": "doc1.pdf",
      "refined_text": "Detailed content...",
      "page_number": 3
    }
  ]
}

Performance Considerations

Processing time scales with document complexity and collection size
Typical processing time: 10-45 seconds for 5-10 documents
Model cache improves performance for subsequent runs
The system uses a maximum of 1GB for models and processing

Sample Collections

The solution includes three test collections:

Collection 1: Travel Planning
- 7 PDFs about South of France
- Persona: Travel Planner
- Task: Plan a 4-day trip for 10 college friends
Collection 2: Adobe Acrobat Learning
- 15 PDFs about Adobe Acrobat features
- Persona: HR Professional
- Task: Create and manage fillable forms
Collection 3: Recipe Collection
- 9 PDFs with food recipes
- Persona: Food Contractor
- Task: Prepare vegetarian buffet-style dinner menu

Each collection demonstrates the system's ability to extract relevant information across different domains and use cases.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Collection 1		Collection 1
Collection 2		Collection 2
Collection 3		Collection 3
Dockerfile		Dockerfile
README.md		README.md
approach_explanation.md		approach_explanation.md
pdf_section_extractor.py		pdf_section_extractor.py
process_persona.py		process_persona.py
requirements.txt		requirements.txt
semantic_analyzer.py		semantic_analyzer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Challenge 1B: Multi-Document Analysis with Persona-Based Intelligence

Table of Contents

Project Overview

Technical Architecture

Key Features

Installation & Usage

Prerequisites

Building the Image

Running the Solution

Configuration

Performance Considerations

Sample Collections

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Challenge 1B: Multi-Document Analysis with Persona-Based Intelligence

Table of Contents

Project Overview

Technical Architecture

Key Features

Installation & Usage

Prerequisites

Building the Image

Running the Solution

Configuration

Performance Considerations

Sample Collections

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages