-
Notifications
You must be signed in to change notification settings - Fork 9
Add file size validation to document upload endpoint #584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add file size validation to document upload endpoint #584
Conversation
📝 WalkthroughWalkthroughAdds project and organization context to jobs via database migrations and model updates. Introduces file size validation for document uploads with a configurable maximum. Updates job creation across services to propagate new context fields. Includes test coverage for upload size restrictions. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RouteHandler as /documents/upload
participant Validator as validate_document_file
participant Storage as Cloud Storage
participant Database as Database
Client->>RouteHandler: POST file + metadata
RouteHandler->>Validator: validate_document_file(file)
alt File size exceeds limit
Validator-->>RouteHandler: HTTP 413 (Payload Too Large)
RouteHandler-->>Client: 413 Error
else File is empty
Validator-->>RouteHandler: HTTP 422 (Unprocessable Entity)
RouteHandler-->>Client: 422 Error
else File size valid
Validator->>Validator: Seek to end, verify size, reset pointer
Validator-->>RouteHandler: Return file_size (bytes)
RouteHandler->>Storage: Upload file
Storage-->>RouteHandler: Upload success
RouteHandler->>Database: Create document record
Database-->>RouteHandler: Record created
RouteHandler-->>Client: 200 Success
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/response/jobs.py (1)
1-6:⚠️ Potential issue | 🟡 MinorAdd a type hint for
task_instance.
task_instanceis untyped on Line 61, which breaks the project’s type-hint requirement.As per coding guidelines, "Always add type hints to all function parameters and return values in Python code".✅ Suggested fix
+from typing import Any @@ def execute_job( request_data: dict, project_id: int, organization_id: int, job_id: str, task_id: str, - task_instance, + task_instance: Any, ) -> None:Also applies to: 55-62
🤖 Fix all issues with AI agents
In `@backend/app/alembic/versions/043_add_project_org_to_job_table.py`:
- Around line 22-41: The migration adds non-nullable columns organization_id and
project_id to the job table using op.add_column without a server_default, which
will fail if rows already exist; update the migration to perform a safe
two-phase change: either (A) add organization_id and project_id with a sensible
server_default (or temporary default value) so existing rows get backfilled,
commit, then remove the server_default and alter nullable to False, or (B) add
both columns as nullable (nullable=True) via op.add_column, run a data backfill
step to populate them, then run a follow-up ALTER to set nullable=False; adjust
the op.add_column calls for "organization_id" and "project_id" accordingly and
include a follow-up migration step to remove defaults or flip nullable once
backfill is done.
In `@backend/app/alembic/versions/044_optimize_conversation_query.py`:
- Around line 18-34: The migration functions upgrade and downgrade lack explicit
return type annotations; update their signatures (functions named upgrade and
downgrade in this migration) to include return type hints (i.e., -> None) so
they comply with the project's mandatory type-hints guideline, leaving the
function bodies unchanged and keeping the existing op.create_index/op.drop_index
calls intact.
- Around line 11-15: The migration functions upgrade() and downgrade() lack
return type annotations; update their definitions to include explicit return
types by changing them to "def upgrade() -> None:" and "def downgrade() ->
None:" so both functions are annotated as returning None (keep bodies unchanged
and only adjust the function signatures for upgrade and downgrade).
In `@backend/app/api/docs/documents/upload.md`:
- Around line 7-11: The docs claim a 50MB max but the code default constant
MAX_DOCUMENT_UPLOAD_SIZE_MB is 512; update the documentation in the upload.md
text to reflect the actual default (change "Maximum file size: 50MB" to "Maximum
file size: 512MB (configurable via MAX_DOCUMENT_UPLOAD_SIZE_MB environment
variable)") and ensure any related lines about rejection behavior remain
unchanged; reference the MAX_DOCUMENT_UPLOAD_SIZE_MB symbol so readers know the
source of truth.
🧹 Nitpick comments (2)
backend/app/models/job.py (1)
92-94: Consider addingback_populatesfor bidirectional navigation.The relationships lack
back_populates, meaning you cannot navigate fromOrganizationorProjectto their associated jobs. If bidirectional access is needed (e.g.,organization.jobs), you'll need to add corresponding relationship fields to those models.♻️ Example with back_populates
# Relationships - organization: Optional["Organization"] = Relationship() - project: Optional["Project"] = Relationship() + organization: Optional["Organization"] = Relationship(back_populates="jobs") + project: Optional["Project"] = Relationship(back_populates="jobs")Then add to
OrganizationandProjectmodels:jobs: list["Job"] = Relationship(back_populates="organization", cascade_delete=True)backend/app/crud/jobs.py (1)
15-31: Consider adding a log statement for job creation.Per the coding guidelines, log messages should be prefixed with the function name. Adding a log entry here would improve observability for job creation events.
📝 Proposed logging addition
self.session.add(new_job) self.session.commit() self.session.refresh(new_job) + logger.info( + f"[create] Job created | job_id={new_job.id}, job_type={job_type}, " + f"project_id={project_id}, organization_id={organization_id}" + ) return new_job
| op.add_column( | ||
| "job", | ||
| sa.Column( | ||
| "organization_id", | ||
| sa.Integer(), | ||
| nullable=False, | ||
| comment="Reference to the organization", | ||
| ), | ||
| ) | ||
|
|
||
| # Add project_id column | ||
| op.add_column( | ||
| "job", | ||
| sa.Column( | ||
| "project_id", | ||
| sa.Integer(), | ||
| nullable=False, | ||
| comment="Reference to the project", | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-nullable columns without defaults will fail if the job table has existing rows.
Adding organization_id and project_id as nullable=False without a server_default will cause the migration to fail on databases with existing job records. Consider one of these approaches:
- Add a
server_defaultand backfill with valid IDs, then remove the default - Make columns nullable initially, backfill, then alter to non-nullable
- If the table is guaranteed empty in all environments, document this assumption
🛠️ Option 1: Two-phase migration with server_default
op.add_column(
"job",
sa.Column(
"organization_id",
sa.Integer(),
- nullable=False,
+ nullable=True,
comment="Reference to the organization",
),
)
op.add_column(
"job",
sa.Column(
"project_id",
sa.Integer(),
- nullable=False,
+ nullable=True,
comment="Reference to the project",
),
)
+
+ # TODO: Backfill existing rows with valid organization_id and project_id
+ # op.execute("UPDATE job SET organization_id = ..., project_id = ... WHERE organization_id IS NULL")
+
+ # Then alter columns to NOT NULL after backfill
+ op.alter_column("job", "organization_id", nullable=False)
+ op.alter_column("job", "project_id", nullable=False)🤖 Prompt for AI Agents
In `@backend/app/alembic/versions/043_add_project_org_to_job_table.py` around
lines 22 - 41, The migration adds non-nullable columns organization_id and
project_id to the job table using op.add_column without a server_default, which
will fail if rows already exist; update the migration to perform a safe
two-phase change: either (A) add organization_id and project_id with a sensible
server_default (or temporary default value) so existing rows get backfilled,
commit, then remove the server_default and alter nullable to False, or (B) add
both columns as nullable (nullable=True) via op.add_column, run a data backfill
step to populate them, then run a follow-up ALTER to set nullable=False; adjust
the op.add_column calls for "organization_id" and "project_id" accordingly and
include a follow-up migration step to remove defaults or flip nullable once
backfill is done.
| # revision identifiers, used by Alembic. | ||
| revision = "044" | ||
| down_revision = "043" | ||
| branch_labels = None | ||
| depends_on = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# List all Alembic revision IDs to confirm the latest revision number.
python - <<'PY'
import re, pathlib
path = pathlib.Path("backend/app/alembic/versions")
revs = []
for f in path.glob("*.py"):
txt = f.read_text(encoding="utf-8")
m = re.search(r'revision\s*=\s*"(\d+)"', txt)
if m:
revs.append((int(m.group(1)), f.name))
for rev, name in sorted(revs):
print(f"{rev:03d} -> {name}")
PYRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 2069
🏁 Script executed:
cat backend/app/alembic/versions/044_optimize_conversation_query.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 934
Add type hints to upgrade() and downgrade() functions.
The revision numbering (044) is correct. However, both the upgrade() and downgrade() functions are missing required type hints. Update them to:
def upgrade() -> None:def downgrade() -> None:🤖 Prompt for AI Agents
In `@backend/app/alembic/versions/044_optimize_conversation_query.py` around lines
11 - 15, The migration functions upgrade() and downgrade() lack return type
annotations; update their definitions to include explicit return types by
changing them to "def upgrade() -> None:" and "def downgrade() -> None:" so both
functions are annotated as returning None (keep bodies unchanged and only adjust
the function signatures for upgrade and downgrade).
| def upgrade(): | ||
| # Create composite index to optimize the get_conversation_by_ancestor_id query | ||
| # This query filters by: ancestor_response_id, project_id, is_deleted | ||
| # and orders by: inserted_at DESC | ||
| op.create_index( | ||
| "ix_openai_conversation_ancestor_project_active_time", | ||
| "openai_conversation", | ||
| ["ancestor_response_id", "project_id", "is_deleted", "inserted_at"], | ||
| unique=False, | ||
| ) | ||
|
|
||
|
|
||
| def downgrade(): | ||
| op.drop_index( | ||
| "ix_openai_conversation_ancestor_project_active_time", | ||
| table_name="openai_conversation", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add required return type hints to migration functions.
Both upgrade and downgrade should be annotated to comply with the mandatory type-hints guideline.
✅ Suggested fix
-def upgrade():
+def upgrade() -> None:
@@
-def downgrade():
+def downgrade() -> None:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def upgrade(): | |
| # Create composite index to optimize the get_conversation_by_ancestor_id query | |
| # This query filters by: ancestor_response_id, project_id, is_deleted | |
| # and orders by: inserted_at DESC | |
| op.create_index( | |
| "ix_openai_conversation_ancestor_project_active_time", | |
| "openai_conversation", | |
| ["ancestor_response_id", "project_id", "is_deleted", "inserted_at"], | |
| unique=False, | |
| ) | |
| def downgrade(): | |
| op.drop_index( | |
| "ix_openai_conversation_ancestor_project_active_time", | |
| table_name="openai_conversation", | |
| ) | |
| def upgrade() -> None: | |
| # Create composite index to optimize the get_conversation_by_ancestor_id query | |
| # This query filters by: ancestor_response_id, project_id, is_deleted | |
| # and orders by: inserted_at DESC | |
| op.create_index( | |
| "ix_openai_conversation_ancestor_project_active_time", | |
| "openai_conversation", | |
| ["ancestor_response_id", "project_id", "is_deleted", "inserted_at"], | |
| unique=False, | |
| ) | |
| def downgrade() -> None: | |
| op.drop_index( | |
| "ix_openai_conversation_ancestor_project_active_time", | |
| table_name="openai_conversation", | |
| ) |
🤖 Prompt for AI Agents
In `@backend/app/alembic/versions/044_optimize_conversation_query.py` around lines
18 - 34, The migration functions upgrade and downgrade lack explicit return type
annotations; update their signatures (functions named upgrade and downgrade in
this migration) to include return type hints (i.e., -> None) so they comply with
the project's mandatory type-hints guideline, leaving the function bodies
unchanged and keeping the existing op.create_index/op.drop_index calls intact.
| ### File Size Restrictions | ||
|
|
||
| - **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable) | ||
| - Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error | ||
| - Empty files will be rejected with a 422 (Unprocessable Entity) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation states 50MB but the code default is 512MB.
The documentation says "Maximum file size: 50MB" but backend/app/core/config.py defines MAX_DOCUMENT_UPLOAD_SIZE_MB: int = 512. Either update the documentation to reflect the actual default or change the config default to 50.
📝 Option: Update documentation to match code
### File Size Restrictions
-- **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable)
+- **Maximum file size**: 512MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable)
- Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error
- Empty files will be rejected with a 422 (Unprocessable Entity) error📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### File Size Restrictions | |
| - **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable) | |
| - Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error | |
| - Empty files will be rejected with a 422 (Unprocessable Entity) error | |
| ### File Size Restrictions | |
| - **Maximum file size**: 512MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable) | |
| - Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error | |
| - Empty files will be rejected with a 422 (Unprocessable Entity) error |
🤖 Prompt for AI Agents
In `@backend/app/api/docs/documents/upload.md` around lines 7 - 11, The docs claim
a 50MB max but the code default constant MAX_DOCUMENT_UPLOAD_SIZE_MB is 512;
update the documentation in the upload.md text to reflect the actual default
(change "Maximum file size: 50MB" to "Maximum file size: 512MB (configurable via
MAX_DOCUMENT_UPLOAD_SIZE_MB environment variable)") and ensure any related lines
about rejection behavior remain unchanged; reference the
MAX_DOCUMENT_UPLOAD_SIZE_MB symbol so readers know the source of truth.
Prajna1999
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not a duplicate PR?
Summary
Result
Improves reliability and prevents oversized uploads.
Summary by CodeRabbit
New Features
Documentation
Tests
Chores