File Processing Architecture

Overview

File processing in the LinX platform is handled through a skill-based architecture. Files (images, documents, audio, video) are not processed by the core system directly. Instead, processing is delegated to dynamically loaded agent skills.

Design Philosophy

Skill-Based Processing

No Built-in Processing: The core agent system does not have built-in file processing capabilities
Dynamic Skills: File processing is handled by skills loaded from the skill library
Graceful Degradation: If an agent lacks the required skill, files are skipped without errors
Extensible: New file types can be supported by adding new skills

Benefits

Modularity: File processing logic is separated from core agent logic
Flexibility: Different agents can have different file processing capabilities
Scalability: Skills can be added/removed without changing core code
Resource Efficiency: Only load processing capabilities when needed
Customization: Each agent can have tailored file processing skills

Skill Types for File Processing

Image Processing Skills

Skill Name: image_processing

Capabilities:

Image recognition and classification
Object detection
Scene understanding
Visual question answering
Image captioning

Example Usage:

# Agent with image_processing skill
agent = Agent(
    name="Vision Agent",
    skills=["image_processing", "general_chat"]
)

# When user sends image + text
# Skill processes image and extracts information
# Agent receives: "User asked: 'What's in this image?' Image contains: [description]"

Document Processing Skills

Skill Name: document_processing

Capabilities:

PDF text extraction
Document structure analysis
Table extraction
Metadata extraction
Multi-page document handling

Example Usage:

# Agent with document_processing skill
agent = Agent(
    name="Document Analyst",
    skills=["document_processing", "data_analysis"]
)

# When user uploads PDF
# Skill extracts text and structure
# Agent receives: "Document content: [extracted text]"

OCR Skills

Skill Name: ocr

Capabilities:

Text extraction from images
Handwriting recognition
Multi-language support
Layout preservation

Example Usage:

# Agent with OCR skill
agent = Agent(
    name="OCR Agent",
    skills=["ocr", "text_analysis"]
)

# When user sends image with text
# Skill extracts text via OCR
# Agent receives: "Extracted text: [OCR result]"

Audio Processing Skills

Skill Name: audio_processing

Capabilities:

Speech-to-text transcription
Audio classification
Speaker identification
Audio analysis

Video Processing Skills

Skill Name: video_processing

Capabilities:

Frame extraction
Video summarization
Action recognition
Scene detection

Processing Flow

1. File Upload

User → Frontend → Backend API
                    ↓
              File Storage (MinIO)
                    ↓
              File Reference Created

2. Agent Receives Message

Message with Files → Agent Executor
                          ↓
                    Check Agent Skills
                          ↓
              ┌─────────────┴─────────────┐
              ↓                           ↓
        Has Required Skill          No Required Skill
              ↓                           ↓
        Load & Execute Skill         Skip File Processing
              ↓                           ↓
        Process File                 Use Text Only
              ↓                           ↓
        Extract Information          Continue Execution
              ↓                           ↓
        Augment Message              ────┘
              ↓
        Agent Processes Augmented Message

3. Skill Execution

# Pseudo-code for skill-based file processing

def process_message_with_files(agent, message, files):
    """Process message with attached files using agent skills."""
    
    # Check if agent has file processing skills
    has_image_skill = "image_processing" in agent.skills
    has_doc_skill = "document_processing" in agent.skills
    has_ocr_skill = "ocr" in agent.skills
    
    augmented_message = message
    
    for file in files:
        if file.type == "image":
            if has_image_skill:
                # Load and execute image processing skill
                skill = load_skill("image_processing")
                result = skill.process(file)
                augmented_message += f"\n\nImage analysis: {result}"
            elif has_ocr_skill:
                # Fallback to OCR if no image processing
                skill = load_skill("ocr")
                text = skill.extract_text(file)
                augmented_message += f"\n\nExtracted text: {text}"
            else:
                # Skip image processing
                logger.info(f"Agent {agent.id} lacks image processing skills, skipping image")
        
        elif file.type == "document":
            if has_doc_skill:
                # Load and execute document processing skill
                skill = load_skill("document_processing")
                content = skill.extract_content(file)
                augmented_message += f"\n\nDocument content: {content}"
            else:
                # Skip document processing
                logger.info(f"Agent {agent.id} lacks document processing skills, skipping document")
    
    # Agent processes the augmented message
    return agent.execute(augmented_message)

Implementation Details

Skill Library Integration

Skills are stored in the skill_library module:

backend/skill_library/
├── __init__.py
├── skill_registry.py          # Skill registration and loading
├── skill_executor.py           # Skill execution engine
├── default_skills.py           # Built-in skills
└── skills/
    ├── image_processing.py     # Image processing skill
    ├── document_processing.py  # Document processing skill
    ├── ocr.py                  # OCR skill
    ├── audio_processing.py     # Audio processing skill
    └── video_processing.py     # Video processing skill

Skill Definition

Each skill follows a standard interface:

from skill_library.skill_model import Skill, SkillParameter

class ImageProcessingSkill(Skill):
    """Skill for processing images."""
    
    name = "image_processing"
    description = "Processes images and extracts visual information"
    version = "1.0.0"
    
    parameters = [
        SkillParameter(
            name="image_path",
            type="string",
            description="Path to image file",
            required=True
        ),
        SkillParameter(
            name="analysis_type",
            type="string",
            description="Type of analysis: caption, objects, scene",
            required=False,
            default="caption"
        )
    ]
    
    def execute(self, image_path: str, analysis_type: str = "caption") -> dict:
        """Execute image processing."""
        # Load image
        image = load_image(image_path)
        
        # Process based on analysis type
        if analysis_type == "caption":
            result = generate_caption(image)
        elif analysis_type == "objects":
            result = detect_objects(image)
        elif analysis_type == "scene":
            result = analyze_scene(image)
        
        return {
            "success": True,
            "result": result,
            "metadata": {
                "image_size": image.size,
                "format": image.format
            }
        }

Agent Configuration

Agents specify their skills in configuration:

# Create agent with file processing skills
agent = Agent(
    name="Multimodal Assistant",
    type="general",
    skills=[
        "general_chat",
        "image_processing",
        "document_processing",
        "web_search"
    ],
    system_prompt="You are a helpful assistant that can process images and documents."
)

Frontend Integration

The frontend prepares files for upload:

// User attaches files
const attachedFiles = [
  { type: 'image', file: imageFile },
  { type: 'document', file: pdfFile }
];

// Send to backend
await agentsApi.testAgent(agentId, message, {
  files: attachedFiles,
  history: conversationHistory
});

Backend Processing

The backend handles file upload and skill execution:

@router.post("/{agent_id}/test")
async def test_agent(agent_id: str, request: TestAgentRequest, files: List[UploadFile] = None):
    """Test agent with message and optional files."""
    
    # Get agent and check skills
    agent = get_agent(agent_id)
    
    # Upload files to storage
    file_refs = []
    if files:
        for file in files:
            file_path = await upload_to_minio(file)
            file_refs.append({
                "path": file_path,
                "type": detect_file_type(file),
                "name": file.filename
            })
    
    # Process message with files using agent skills
    augmented_message = await process_with_skills(
        agent=agent,
        message=request.message,
        files=file_refs
    )
    
    # Execute agent with augmented message
    result = await agent.execute(augmented_message)
    
    return result

Graceful Degradation

No Skills Available

If an agent has no file processing skills:

# Agent without file processing skills
agent = Agent(
    name="Text-Only Agent",
    skills=["general_chat", "web_search"]
)

# User sends image + text
# System behavior:
# 1. Detects no image_processing skill
# 2. Logs: "Agent lacks image processing skills, skipping image"
# 3. Processes only the text message
# 4. Agent responds based on text only

Partial Skills

If an agent has some but not all file processing skills:

# Agent with only OCR skill
agent = Agent(
    name="OCR Agent",
    skills=["general_chat", "ocr"]
)

# User sends image + PDF
# System behavior:
# 1. Image: Uses OCR skill to extract text
# 2. PDF: No document_processing skill, skips PDF
# 3. Agent receives: "User message + OCR text from image"

Future Enhancements

1. Skill Marketplace

Users can browse and install skills
Community-contributed skills
Skill ratings and reviews

2. Skill Chaining

Combine multiple skills for complex processing
Example: OCR → Translation → Summarization

3. Skill Versioning

Multiple versions of same skill
Backward compatibility
Automatic updates

4. Skill Analytics

Track skill usage
Performance metrics
Error rates

5. Custom Skills

Users can create custom skills
Upload Python code
Sandbox execution

Best Practices

For Agent Creators

Choose Appropriate Skills: Select skills based on agent's purpose
Test with Files: Verify file processing works as expected
Provide Clear Instructions: Tell users what file types are supported
Handle Failures Gracefully: Agent should work even if file processing fails

For Skill Developers

Follow Skill Interface: Implement standard Skill class
Handle Errors: Return meaningful error messages
Optimize Performance: Process files efficiently
Document Capabilities: Clear description of what skill does
Version Properly: Use semantic versioning

FilesExpand file tree

file-processing-architecture.md

Latest commit

History

file-processing-architecture.md

File metadata and controls

File Processing Architecture

Overview

Design Philosophy

Skill-Based Processing

Benefits

Skill Types for File Processing

Image Processing Skills

Document Processing Skills

OCR Skills

Audio Processing Skills

Video Processing Skills

Processing Flow

1. File Upload

2. Agent Receives Message

3. Skill Execution

Implementation Details

Skill Library Integration

Skill Definition

Agent Configuration

Frontend Integration

Backend Processing

Graceful Degradation

No Skills Available

Partial Skills

Future Enhancements

1. Skill Marketplace

2. Skill Chaining

3. Skill Versioning

4. Skill Analytics

5. Custom Skills

Best Practices

For Agent Creators

For Skill Developers

For System Administrators

Security Considerations

File Validation

Skill Sandboxing

Access Control

References