Add file size validation to document upload endpoint #584

ankit-mehta07 · 2026-02-05T09:25:27Z

Summary

Added file size validation to document upload endpoint
Rejects files exceeding the allowed limit
Prevents empty file uploads

Result

Improves reliability and prevents oversized uploads.

Summary by CodeRabbit

New Features
- Added file upload validation with a maximum 50MB file size limit; empty files are now rejected.
Documentation
- Updated upload documentation with file size restrictions and error handling details.
Tests
- Added comprehensive tests for file upload validation and edge cases.
Chores
- Optimized conversation query performance through database indexing improvements.

coderabbitai · 2026-02-05T09:25:43Z

📝 Walkthrough

Walkthrough

Adds project and organization context to jobs via database migrations and model updates. Introduces file size validation for document uploads with a configurable maximum. Updates job creation across services to propagate new context fields. Includes test coverage for upload size restrictions.

Changes

Cohort / File(s)	Summary
Database Migrations `backend/app/alembic/versions/043_add_project_org_to_job_table.py`, `backend/app/alembic/versions/044_optimize_conversation_query.py`	Migration 043 adds project_id and organization_id foreign key columns to job table with cascade delete and indexes. Migration 044 creates a composite index on openai_conversation table to optimize ancestor_id queries.
Configuration & Documentation `backend/app/core/config.py`, `backend/app/api/docs/documents/upload.md`	Added MAX_DOCUMENT_UPLOAD_SIZE_MB configuration setting (default 512MB). Updated upload documentation with file size restriction details: 50MB max (configurable), 413 on oversized files, 422 on empty files.
Document Upload Validation `backend/app/services/documents/validators.py`, `backend/app/api/routes/documents.py`	New validators module with validate_document_file function performing async file size checks. Route handler updated to validate file size before upload, integrating validation into upload control flow.
Job Model & Data Access `backend/app/models/job.py`, `backend/app/crud/jobs.py`	Job model expanded with project_id, organization_id foreign key fields and relationships. JobUpdate extended with error_message and task_id fields. JobCrud.create now requires project_id and organization_id parameters.
Service Layer Integration `backend/app/services/llm/jobs.py`, `backend/app/services/response/jobs.py`	Job creation calls updated to pass project_id and organization_id alongside job_type and trace_id, propagating context through both LLM and response job creation paths.
Upload Validation Tests `backend/app/tests/api/routes/documents/test_route_document_upload.py`	Added test coverage for file upload size handling: validates 413 response for oversized files, 422 for empty files, and successful creation within limits. Uses patches to customize MAX_DOCUMENT_SIZE for testing.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant RouteHandler as /documents/upload
    participant Validator as validate_document_file
    participant Storage as Cloud Storage
    participant Database as Database
    
    Client->>RouteHandler: POST file + metadata
    RouteHandler->>Validator: validate_document_file(file)
    
    alt File size exceeds limit
        Validator-->>RouteHandler: HTTP 413 (Payload Too Large)
        RouteHandler-->>Client: 413 Error
    else File is empty
        Validator-->>RouteHandler: HTTP 422 (Unprocessable Entity)
        RouteHandler-->>Client: 422 Error
    else File size valid
        Validator->>Validator: Seek to end, verify size, reset pointer
        Validator-->>RouteHandler: Return file_size (bytes)
        RouteHandler->>Storage: Upload file
        Storage-->>RouteHandler: Upload success
        RouteHandler->>Database: Create document record
        Database-->>RouteHandler: Record created
        RouteHandler-->>Client: 200 Success
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A job now knows its project home,
And files that grow too large we roam,
With validations swift and sure,
Our uploads safe and clean and pure! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Add file size validation to document upload endpoint' accurately and concisely summarizes the primary change in the changeset: implementing file size validation for document uploads.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/services/response/jobs.py (1)
1-6: ⚠️ Potential issue | 🟡 Minor

Add a type hint for task_instance.
task_instance is untyped on Line 61, which breaks the project’s type-hint requirement.
✅ Suggested fix
+from typing import Any
@@
 def execute_job(
     request_data: dict,
     project_id: int,
     organization_id: int,
     job_id: str,
     task_id: str,
-    task_instance,
+    task_instance: Any,
 ) -> None:
As per coding guidelines, "Always add type hints to all function parameters and return values in Python code".
Also applies to: 55-62

🤖 Fix all issues with AI agents

In `@backend/app/alembic/versions/043_add_project_org_to_job_table.py`:
- Around line 22-41: The migration adds non-nullable columns organization_id and
project_id to the job table using op.add_column without a server_default, which
will fail if rows already exist; update the migration to perform a safe
two-phase change: either (A) add organization_id and project_id with a sensible
server_default (or temporary default value) so existing rows get backfilled,
commit, then remove the server_default and alter nullable to False, or (B) add
both columns as nullable (nullable=True) via op.add_column, run a data backfill
step to populate them, then run a follow-up ALTER to set nullable=False; adjust
the op.add_column calls for "organization_id" and "project_id" accordingly and
include a follow-up migration step to remove defaults or flip nullable once
backfill is done.

In `@backend/app/alembic/versions/044_optimize_conversation_query.py`:
- Around line 18-34: The migration functions upgrade and downgrade lack explicit
return type annotations; update their signatures (functions named upgrade and
downgrade in this migration) to include return type hints (i.e., -> None) so
they comply with the project's mandatory type-hints guideline, leaving the
function bodies unchanged and keeping the existing op.create_index/op.drop_index
calls intact.
- Around line 11-15: The migration functions upgrade() and downgrade() lack
return type annotations; update their definitions to include explicit return
types by changing them to "def upgrade() -> None:" and "def downgrade() ->
None:" so both functions are annotated as returning None (keep bodies unchanged
and only adjust the function signatures for upgrade and downgrade).

In `@backend/app/api/docs/documents/upload.md`:
- Around line 7-11: The docs claim a 50MB max but the code default constant
MAX_DOCUMENT_UPLOAD_SIZE_MB is 512; update the documentation in the upload.md
text to reflect the actual default (change "Maximum file size: 50MB" to "Maximum
file size: 512MB (configurable via MAX_DOCUMENT_UPLOAD_SIZE_MB environment
variable)") and ensure any related lines about rejection behavior remain
unchanged; reference the MAX_DOCUMENT_UPLOAD_SIZE_MB symbol so readers know the
source of truth.

🧹 Nitpick comments (2)

backend/app/models/job.py (1)
92-94: Consider adding back_populates for bidirectional navigation.

The relationships lack back_populates, meaning you cannot navigate from Organization or Project to their associated jobs. If bidirectional access is needed (e.g., organization.jobs), you'll need to add corresponding relationship fields to those models.
♻️ Example with back_populates
     # Relationships
-    organization: Optional["Organization"] = Relationship()
-    project: Optional["Project"] = Relationship()
+    organization: Optional["Organization"] = Relationship(back_populates="jobs")
+    project: Optional["Project"] = Relationship(back_populates="jobs")
Then add to Organization and Project models:
jobs: list["Job"] = Relationship(back_populates="organization", cascade_delete=True)
backend/app/crud/jobs.py (1)
15-31: Consider adding a log statement for job creation.

Per the coding guidelines, log messages should be prefixed with the function name. Adding a log entry here would improve observability for job creation events.
📝 Proposed logging addition
         self.session.add(new_job)
         self.session.commit()
         self.session.refresh(new_job)
+        logger.info(
+            f"[create] Job created | job_id={new_job.id}, job_type={job_type}, "
+            f"project_id={project_id}, organization_id={organization_id}"
+        )
         return new_job

coderabbitai · 2026-02-05T09:28:52Z

backend/app/alembic/versions/043_add_project_org_to_job_table.py

+    op.add_column(
+        "job",
+        sa.Column(
+            "organization_id",
+            sa.Integer(),
+            nullable=False,
+            comment="Reference to the organization",
+        ),
+    )
+
+    # Add project_id column
+    op.add_column(
+        "job",
+        sa.Column(
+            "project_id",
+            sa.Integer(),
+            nullable=False,
+            comment="Reference to the project",
+        ),
+    )


⚠️ Potential issue | 🟠 Major

Non-nullable columns without defaults will fail if the job table has existing rows.

Adding organization_id and project_id as nullable=False without a server_default will cause the migration to fail on databases with existing job records. Consider one of these approaches:

Add a server_default and backfill with valid IDs, then remove the default

Make columns nullable initially, backfill, then alter to non-nullable

If the table is guaranteed empty in all environments, document this assumption

🛠️ Option 1: Two-phase migration with server_default

op.add_column( "job", sa.Column( "organization_id", sa.Integer(), - nullable=False, + nullable=True, comment="Reference to the organization", ), ) op.add_column( "job", sa.Column( "project_id", sa.Integer(), - nullable=False, + nullable=True, comment="Reference to the project", ), ) + + # TODO: Backfill existing rows with valid organization_id and project_id + # op.execute("UPDATE job SET organization_id = ..., project_id = ... WHERE organization_id IS NULL") + + # Then alter columns to NOT NULL after backfill + op.alter_column("job", "organization_id", nullable=False) + op.alter_column("job", "project_id", nullable=False)

🤖 Prompt for AI Agents

In `@backend/app/alembic/versions/043_add_project_org_to_job_table.py` around lines 22 - 41, The migration adds non-nullable columns organization_id and project_id to the job table using op.add_column without a server_default, which will fail if rows already exist; update the migration to perform a safe two-phase change: either (A) add organization_id and project_id with a sensible server_default (or temporary default value) so existing rows get backfilled, commit, then remove the server_default and alter nullable to False, or (B) add both columns as nullable (nullable=True) via op.add_column, run a data backfill step to populate them, then run a follow-up ALTER to set nullable=False; adjust the op.add_column calls for "organization_id" and "project_id" accordingly and include a follow-up migration step to remove defaults or flip nullable once backfill is done.

coderabbitai · 2026-02-05T09:28:52Z

backend/app/alembic/versions/044_optimize_conversation_query.py

+# revision identifiers, used by Alembic.
+revision = "044"
+down_revision = "043"
+branch_labels = None
+depends_on = None


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # List all Alembic revision IDs to confirm the latest revision number. python - <<'PY' import re, pathlib path = pathlib.Path("backend/app/alembic/versions") revs = [] for f in path.glob("*.py"): txt = f.read_text(encoding="utf-8") m = re.search(r'revision\s*=\s*"(\d+)"', txt) if m: revs.append((int(m.group(1)), f.name)) for rev, name in sorted(revs): print(f"{rev:03d} -> {name}") PY

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2069

🏁 Script executed:

cat backend/app/alembic/versions/044_optimize_conversation_query.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 934

Add type hints to upgrade() and downgrade() functions.

The revision numbering (044) is correct. However, both the upgrade() and downgrade() functions are missing required type hints. Update them to:

def upgrade() -> None:

def downgrade() -> None:

🤖 Prompt for AI Agents

In `@backend/app/alembic/versions/044_optimize_conversation_query.py` around lines 11 - 15, The migration functions upgrade() and downgrade() lack return type annotations; update their definitions to include explicit return types by changing them to "def upgrade() -> None:" and "def downgrade() -> None:" so both functions are annotated as returning None (keep bodies unchanged and only adjust the function signatures for upgrade and downgrade).

coderabbitai · 2026-02-05T09:28:52Z

backend/app/alembic/versions/044_optimize_conversation_query.py

+def upgrade():
+    # Create composite index to optimize the get_conversation_by_ancestor_id query
+    # This query filters by: ancestor_response_id, project_id, is_deleted
+    # and orders by: inserted_at DESC
+    op.create_index(
+        "ix_openai_conversation_ancestor_project_active_time",
+        "openai_conversation",
+        ["ancestor_response_id", "project_id", "is_deleted", "inserted_at"],
+        unique=False,
+    )
+
+
+def downgrade():
+    op.drop_index(
+        "ix_openai_conversation_ancestor_project_active_time",
+        table_name="openai_conversation",
+    )


🛠️ Refactor suggestion | 🟠 Major

Add required return type hints to migration functions.

Both upgrade and downgrade should be annotated to comply with the mandatory type-hints guideline.

✅ Suggested fix

-def upgrade(): +def upgrade() -> None: @@ -def downgrade(): +def downgrade() -> None:

As per coding guidelines: “**/*.py: Always add type hints to all function parameters and return values in Python code”.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def upgrade():

# Create composite index to optimize the get_conversation_by_ancestor_id query

# This query filters by: ancestor_response_id, project_id, is_deleted

# and orders by: inserted_at DESC

op.create_index(

"ix_openai_conversation_ancestor_project_active_time",

"openai_conversation",

["ancestor_response_id", "project_id", "is_deleted", "inserted_at"],

unique=False,

)

def downgrade():

op.drop_index(

"ix_openai_conversation_ancestor_project_active_time",

table_name="openai_conversation",

)

def upgrade() -> None:

# Create composite index to optimize the get_conversation_by_ancestor_id query

# This query filters by: ancestor_response_id, project_id, is_deleted

# and orders by: inserted_at DESC

op.create_index(

"ix_openai_conversation_ancestor_project_active_time",

"openai_conversation",

["ancestor_response_id", "project_id", "is_deleted", "inserted_at"],

unique=False,

)

def downgrade() -> None:

op.drop_index(

"ix_openai_conversation_ancestor_project_active_time",

table_name="openai_conversation",

)

🤖 Prompt for AI Agents

In `@backend/app/alembic/versions/044_optimize_conversation_query.py` around lines 18 - 34, The migration functions upgrade and downgrade lack explicit return type annotations; update their signatures (functions named upgrade and downgrade in this migration) to include return type hints (i.e., -> None) so they comply with the project's mandatory type-hints guideline, leaving the function bodies unchanged and keeping the existing op.create_index/op.drop_index calls intact.

coderabbitai · 2026-02-05T09:28:52Z

backend/app/api/docs/documents/upload.md

+### File Size Restrictions
+
+- **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable)
+- Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error
+- Empty files will be rejected with a 422 (Unprocessable Entity) error


⚠️ Potential issue | 🟡 Minor

Documentation states 50MB but the code default is 512MB.

The documentation says "Maximum file size: 50MB" but backend/app/core/config.py defines MAX_DOCUMENT_UPLOAD_SIZE_MB: int = 512. Either update the documentation to reflect the actual default or change the config default to 50.

📝 Option: Update documentation to match code

### File Size Restrictions -- **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable) +- **Maximum file size**: 512MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable) - Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error - Empty files will be rejected with a 422 (Unprocessable Entity) error

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### File Size Restrictions

- **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable)

- Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error

- Empty files will be rejected with a 422 (Unprocessable Entity) error

### File Size Restrictions

- **Maximum file size**: 512MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable)

- Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error

- Empty files will be rejected with a 422 (Unprocessable Entity) error

🤖 Prompt for AI Agents

In `@backend/app/api/docs/documents/upload.md` around lines 7 - 11, The docs claim a 50MB max but the code default constant MAX_DOCUMENT_UPLOAD_SIZE_MB is 512; update the documentation in the upload.md text to reflect the actual default (change "Maximum file size: 50MB" to "Maximum file size: 512MB (configurable via MAX_DOCUMENT_UPLOAD_SIZE_MB environment variable)") and ensure any related lines about rejection behavior remain unchanged; reference the MAX_DOCUMENT_UPLOAD_SIZE_MB symbol so readers know the source of truth.

Prajna1999

Is this not a duplicate PR?

ankit-mehta07 added 4 commits February 3, 2026 18:37

Add file size validation to document upload endpoint

828fa51

Remove trailing slashes from API endpoints

652adb9

Add project and organization foreign keys to async Job model

d2bfc46

Address review comments and update file size validation

1adc93c

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

Prajna1999 reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file size validation to document upload endpoint #584

Add file size validation to document upload endpoint #584

Uh oh!

ankit-mehta07 commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

Prajna1999 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add file size validation to document upload endpoint #584

Are you sure you want to change the base?

Add file size validation to document upload endpoint #584

Uh oh!

Conversation

ankit-mehta07 commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Result

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Prajna1999 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ankit-mehta07 commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading