-
Notifications
You must be signed in to change notification settings - Fork 9
Evaluation: STT #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluation: STT #571
Conversation
…o feature/stt-evaluation
…o feature/stt-evaluation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds end-to-end Speech-to-Text (STT) evaluation: API endpoints, models, CRUD, services, Gemini batch provider, batch polling/processing, storage/file handling, DB migration, and comprehensive tests and docs. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as STT API
participant Service
participant Storage
participant DB as Database
Client->>API: POST /files/audio (multipart)
API->>Service: validate_audio_file(file)
Service->>Service: check extension & size
Service->>Storage: upload file to object store
Storage-->>Service: object_store_url
Service->>DB: create_file record
DB-->>API: file metadata
API-->>Client: AudioUploadResponse (s3_url, file_id)
sequenceDiagram
participant Client
participant API as STT API
participant CRUD
participant BatchSvc as Batch Service
participant Storage
participant Gemini
Client->>API: POST /runs (start evaluation)
API->>CRUD: validate dataset & list samples
CRUD-->>API: samples
API->>CRUD: create run & create result records
API->>BatchSvc: start_stt_evaluation_batch(run, samples)
BatchSvc->>Storage: generate signed URLs for sample files
BatchSvc->>BatchSvc: build JSONL requests
BatchSvc->>Gemini: submit batch job(s)
Gemini-->>BatchSvc: provider batch id/status
BatchSvc->>CRUD: update_stt_run(status=processing, batch ids)
API-->>Client: STTEvaluationRunPublic (processing)
sequenceDiagram
participant Cron
participant CRUD as Run CRUD
participant Gemini
participant Results as Result CRUD
participant DB as Database
Cron->>CRUD: get_pending_stt_runs()
CRUD-->>Cron: pending runs with batch_job_id
loop each run
Cron->>Gemini: get_batch_status(batch_id)
Gemini-->>Cron: state
alt terminal
Cron->>Gemini: download_batch_results(batch_output_id)
Gemini-->>Cron: results JSONL
Cron->>Results: update_stt_result entries
Results->>DB: persist updates
Cron->>CRUD: update_stt_run(status=completed/failed)
else
Cron-->>Cron: continue polling
end
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…/kaapi-backend into feature/stt-evaluation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/models/evaluation.py (1)
374-394:⚠️ Potential issue | 🟠 Major
EvaluationRunPublicandEvaluationDatasetPublicare missing the newtype,language_id, andprovidersfields.The table models were extended with
type,language_id(on both), andproviders(onEvaluationRun), but their corresponding public response models don't expose these fields. API consumers won't see the new metadata.Proposed fix
Add to
EvaluationRunPublic:class EvaluationRunPublic(SQLModel): """Public model for evaluation runs.""" id: int run_name: str dataset_name: str + type: str + language_id: int | None + providers: list[str] | None config_id: UUID | NoneAdd to
EvaluationDatasetPublic:class EvaluationDatasetPublic(SQLModel): """Public model for evaluation datasets.""" id: int name: str description: str | None + type: str + language_id: int | None dataset_metadata: dict[str, Any]
🤖 Fix all issues with AI agents
In `@backend/app/api/routes/stt_evaluations/evaluation.py`:
- Around line 137-162: The code re-fetches the run with get_stt_run_by_id (which
returns EvaluationRun | None) then immediately dereferences run fields; add a
None check after the call and return a 404/error response if run is None, or
replace the re-query with session.refresh(run) to reload the existing ORM object
instead of calling get_stt_run_by_id; update the logic around run (used in
STTEvaluationRunPublic and APIResponse.success_response) to only proceed when
run is present.
- Line 108: The dict comprehension sample_to_result = {r.stt_sample_id: r.id for
r in results} loses entries when multiple providers produce results for the same
stt_sample_id; change sample_to_result in evaluation.py to map each
stt_sample_id to a list of result ids (e.g., append to a list per key) so no
results are overwritten, and then update batch.py (functions around
start_stt_evaluation_batch and the batch failure handling) to accept and iterate
list-valued mapping entries when collecting result IDs to mark as FAILED
(flatten lists or handle both single id and list cases) so all provider-specific
results transition from PENDING to FAILED on batch failure.
In `@backend/app/crud/file.py`:
- Line 8: The import of FileType from app.models.file is unused in
backend/app/crud/file.py; remove FileType from the import statement (leave File)
or, if intended, use FileType in function signatures or type hints (e.g., in
functions that create or filter files) so it is referenced; update the import
line that currently references File and FileType accordingly and run linters to
ensure no other unused imports remain.
In `@backend/app/crud/stt_evaluations/batch.py`:
- Around line 144-156: The except block in start_stt_evaluation_batch currently
raises a new Exception which loses the original traceback; change the final
raise to preserve the original exception (either re-raise the caught exception
or use raise Exception(...) from e) so the cause chain is kept, and keep the
existing logging and update_stt_result calls (referencing update_stt_result,
STTResultStatus, sample_to_result, and logger) intact.
In `@backend/app/services/stt_evaluations/dataset.py`:
- Around line 69-72: The metadata calculation currently counts ground truth by
truthiness which skips empty strings; update the has_ground_truth_count
computation to count samples where s.ground_truth is not None (e.g., replace the
truthiness check with an explicit "s.ground_truth is not None") so empty string
ground truths are counted; change the generator used to compute
metadata["has_ground_truth_count"] in the dataset.py code that builds metadata
for samples.
- Around line 54-103: upload_stt_dataset currently calls create_stt_dataset and
then create_stt_samples in separate commits, which can leave an orphan dataset
if sample creation (e.g., validate_file_ids) fails; wrap both dataset and sample
creation inside a single DB transaction (use session.begin_nested() or
session.begin()) so that a failure during create_stt_samples triggers a
rollback, and if you prefer cleanup instead, catch exceptions around
create_stt_samples, call session.delete(dataset) and session.commit() (or
rollback then delete) to remove the dataset; update upload_stt_dataset to
perform these changes and reference the existing functions create_stt_dataset,
create_stt_samples and the failing validator validate_file_ids to ensure all DB
writes are atomic.
In `@backend/app/tests/api/routes/test_stt_evaluation.py`:
- Around line 543-607: Add a happy-path test in the TestSTTEvaluationRun suite
that posts to "/api/v1/evaluations/stt/runs" with a valid dataset (create or
fixture with samples) and a run_name/providers payload, mock the
start_stt_evaluation_batch function to avoid external calls, assert a 200
response (or 201 per API) and that response.json() contains success True and
expected run metadata, and verify the mocked start_stt_evaluation_batch was
called once with the created dataset id and the provided providers/run_name;
reference the test helper names used in this file and the
start_stt_evaluation_batch symbol for locating the code to mock and assert.
- Around line 514-541: The fixture test_dataset_with_samples creates samples but
leaves the dataset metadata (created via create_test_stt_dataset) with a stale
sample_count=0, which will make evaluations reject it; after creating the three
samples with create_test_file and create_test_stt_sample, update the
EvaluationDataset instance returned by create_test_stt_dataset by setting its
dataset_metadata["sample_count"] to the actual number of created samples (e.g.,
3) and persist that change using the provided db Session (flush/commit or the
project/dataset update helper) so the dataset reflects the correct sample_count
when used by start_stt_evaluation.
🧹 Nitpick comments (18)
backend/app/tests/services/stt_evaluations/test_audio.py (2)
299-327:test_successful_uploadasserts ons3_urlbut doesn't verify the generated S3 key contains a UUID.The service generates a unique filename with a UUID before uploading. The test asserts the exact URL returned by
mock_storage.put.return_value, which is fine for verifying the pass-through, but doesn't verifystorage.putwas called with a UUID-based key. Consider asserting onmock_storage.put.call_argsto verify the key pattern.
330-346: ImportingHTTPExceptioninside the test method body is unconventional.The
from app.core.exception_handlers import HTTPExceptionimport on Line 332 (and similarly Line 352) could be moved to the module's top-level imports for clarity and consistency.backend/app/services/stt_evaluations/dataset.py (1)
144-148: Uselogger.warninginstead oflogger.infowhen the upload returnsNone.When
upload_csv_to_object_storereturnsNone, it signals a failure to persist the CSV. Logging this atinfolevel makes it easy to miss in production. The error path on Line 153 already useslogger.warning, so this branch should be consistent.Proposed fix
else: - logger.info( + logger.warning( "[_upload_samples_to_object_store] Upload returned None | " "continuing without object store storage" )backend/app/alembic/versions/044_add_stt_evaluation_tables.py (2)
110-115:size_bytesusessa.Integer()— this limits file sizes to ~2.1 GB.PostgreSQL
INTEGERis a signed 32-bit type (max ~2,147,483,647 bytes ≈ 2 GB). While there's an application-levelMAX_FILE_SIZE_BYTEScheck, thefiletable is generic and may store non-audio files in the future. Consider usingsa.BigInteger()for forward-compatibility.
290-401: Consider adding a unique constraint on(stt_sample_id, evaluation_run_id, provider)instt_result.Without this, nothing prevents duplicate transcription results for the same sample, run, and provider combination. A retry or idempotency bug could insert duplicates silently.
backend/app/models/file.py (1)
87-99:FilePublicextendsBaseModel— other public models in this PR useSQLModel.
EvaluationDatasetPublicandEvaluationRunPublicinevaluation.pyextendSQLModel, whileFilePublicextendspydantic.BaseModel. Both work, but being consistent within the project makes the pattern predictable. Consider aligning with the existing convention.Proposed fix
-from pydantic import BaseModel from sqlmodel import Field as SQLField from sqlmodel import SQLModel ... -class FilePublic(BaseModel): +class FilePublic(SQLModel): """Public model for file responses."""backend/app/crud/file.py (1)
13-67:create_filecommits immediately — this prevents callers from using it within a larger transaction.The
upload_stt_datasetworkflow indataset.pycallscreate_stt_dataset(which commits) and thencreate_stt_samples(which also commits). Ifcreate_fileis invoked in upstream flows that also need atomic operations, the eager commit would break transactional guarantees. Consider accepting acommitflag or usingsession.flush()by default, letting the caller decide when to commit.backend/app/models/evaluation.py (1)
104-109: UseEvaluationTypeenum for thetypefield in bothEvaluationDatasetandEvaluationRun.The
typefields at lines 104-109 and 209-214 accept any string up to 20 characters, but they should only hold one of three values:"text","stt", or"tts". AnEvaluationTypeenum already exists inbackend/app/models/stt_evaluation.pywith these exact values. Import and apply it to enforce type safety at both the model and database levels.backend/app/services/stt_evaluations/audio.py (1)
165-170: Silent exception swallowing hides debugging information.The bare
except Exceptiondiscards the error detail. If the S3 size lookup consistently fails for a particular storage configuration, this will be invisible in logs.Suggested fix
try: size_kb = storage.get_file_size_kb(s3_url) size_bytes = int(size_kb * 1024) - except Exception: + except Exception as e: + logger.warning( + f"[upload_audio_file] Could not get file size from S3 | " + f"s3_url: {s3_url}, error: {str(e)}" + ) # If we can't get size from S3, use the upload file size size_bytes = file.size or 0backend/app/crud/stt_evaluations/dataset.py (2)
59-70: Inconsistent timestamp usage — two separatenow()calls vs. a shared variable.
create_stt_samples(Line 173) uses a singletimestamp = now()variable for both fields, but herenow()is called twice. Consider storing it in a local variable for consistency.Suggested fix
+ timestamp = now() dataset = EvaluationDataset( name=name, description=description, type=EvaluationType.STT.value, language_id=language_id, object_store_url=object_store_url, dataset_metadata=dataset_metadata or {}, organization_id=org_id, project_id=project_id, - inserted_at=now(), - updated_at=now(), + inserted_at=timestamp, + updated_at=timestamp, )
196-198: Redundantflush()beforecommit().
session.commit()implicitly flushes, so the explicitsession.flush()on Line 197 is unnecessary.Suggested fix
session.add_all(created_samples) - session.flush() session.commit()backend/app/crud/stt_evaluations/result.py (2)
68-70: Redundantflush()beforecommit().Same pattern as in
dataset.py—commit()implies a flush.Suggested fix
session.add_all(results) - session.flush() session.commit()
345-353: Simplify dict comprehension todict(rows).Per static analysis (Ruff C416), the dict comprehension is unnecessary since
rowsalready yields(key, value)tuples.Suggested fix
- return {status: count for status, count in rows} + return dict(rows)backend/app/crud/stt_evaluations/batch.py (2)
113-114: Use a more specific exception type instead of bareException.Line 114 raises a generic
Exception. Consider usingRuntimeErroror a custom exception for clearer error handling upstream.Suggested fix
- raise Exception("Failed to generate signed URLs for any audio files") + raise RuntimeError("Failed to generate signed URLs for any audio files")
1-40: This module contains orchestration logic beyond CRUD — consider relocating toservices/.This module initializes a Gemini client, generates signed URLs, builds JSONL payloads, and submits batch jobs — all of which are business/orchestration logic. As per coding guidelines,
backend/app/crud/should contain database access operations, whilebackend/app/services/should implement business logic. Consider moving this tobackend/app/services/stt_evaluations/batch.py.backend/app/api/routes/stt_evaluations/evaluation.py (1)
240-261: Consider extracting a helper to buildSTTEvaluationRunPublicfrom anEvaluationRun.The field-by-field construction of the public model is repeated verbatim in
start_stt_evaluation(lines 145-161) and here (lines 241-258). A small helper (ormodel_validate/from_orm) would eliminate this duplication and reduce the risk of the two copies drifting apart.backend/app/tests/api/routes/test_stt_evaluation.py (2)
382-410: Pagination test doesn't assertmetadata["total"], andlen(data) == 2may be fragile.If other tests in the session leave datasets in the DB (depending on test isolation),
len(data) == 2could fail.test_list_stt_datasets_with_datacorrectly uses>= 2. Also consider assertingmetadata["total"]to verify the server-side count is consistent.
610-642: Consider adding a happy-path test for feedback update.Both tests here cover error cases. A test that creates a result record and then successfully updates feedback would verify the core feedback flow works end-to-end.
| run = get_stt_run_by_id( | ||
| session=_session, | ||
| run_id=run.id, | ||
| org_id=auth_context.organization_.id, | ||
| project_id=auth_context.project_.id, | ||
| ) | ||
|
|
||
| return APIResponse.success_response( | ||
| data=STTEvaluationRunPublic( | ||
| id=run.id, | ||
| run_name=run.run_name, | ||
| dataset_name=run.dataset_name, | ||
| type=run.type, | ||
| language_id=run.language_id, | ||
| providers=run.providers, | ||
| dataset_id=run.dataset_id, | ||
| status=run.status, | ||
| total_items=run.total_items, | ||
| score=run.score, | ||
| error_message=run.error_message, | ||
| organization_id=run.organization_id, | ||
| project_id=run.project_id, | ||
| inserted_at=run.inserted_at, | ||
| updated_at=run.updated_at, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing None check after re-fetching the run.
get_stt_run_by_id returns EvaluationRun | None. If the run were missing (e.g., concurrent deletion), accessing run.id on line 146 would raise an AttributeError. Add a guard or use session.refresh(run) instead, which is simpler and avoids re-querying.
Proposed fix — use session.refresh instead
- # Refresh run to get updated status
- run = get_stt_run_by_id(
- session=_session,
- run_id=run.id,
- org_id=auth_context.organization_.id,
- project_id=auth_context.project_.id,
- )
+ # Refresh run to get updated status
+ _session.refresh(run)🤖 Prompt for AI Agents
In `@backend/app/api/routes/stt_evaluations/evaluation.py` around lines 137 - 162,
The code re-fetches the run with get_stt_run_by_id (which returns EvaluationRun
| None) then immediately dereferences run fields; add a None check after the
call and return a 404/error response if run is None, or replace the re-query
with session.refresh(run) to reload the existing ORM object instead of calling
get_stt_run_by_id; update the logic around run (used in STTEvaluationRunPublic
and APIResponse.success_response) to only proceed when run is present.
| from sqlmodel import Session, select | ||
|
|
||
| from app.core.util import now | ||
| from app.models.file import File, FileType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FileType is imported but unused.
FileType from app.models.file is imported on Line 8 but never referenced in any function body or signature in this file.
Proposed fix
-from app.models.file import File, FileType
+from app.models.file import File📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from app.models.file import File, FileType | |
| from app.models.file import File |
🤖 Prompt for AI Agents
In `@backend/app/crud/file.py` at line 8, The import of FileType from
app.models.file is unused in backend/app/crud/file.py; remove FileType from the
import statement (leave File) or, if intended, use FileType in function
signatures or type hints (e.g., in functions that create or filter files) so it
is referenced; update the import line that currently references File and
FileType accordingly and run linters to ensure no other unused imports remain.
| except Exception as e: | ||
| logger.error( | ||
| f"[start_stt_evaluation_batch] Failed to submit batch | " | ||
| f"model: {model}, error: {str(e)}" | ||
| ) | ||
| for result_id in sample_to_result.values(): | ||
| update_stt_result( | ||
| session=session, | ||
| result_id=result_id, | ||
| status=STTResultStatus.FAILED.value, | ||
| error_message=f"Batch submission failed: {str(e)}", | ||
| ) | ||
| raise Exception(f"Batch submission failed: {str(e)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-raised exception loses the original traceback.
Wrapping in a new Exception discards the cause chain. Use raise ... from e to preserve it, or use a domain-specific exception type.
Suggested fix
- raise Exception(f"Batch submission failed: {str(e)}")
+ raise RuntimeError(f"Batch submission failed: {str(e)}") from e🤖 Prompt for AI Agents
In `@backend/app/crud/stt_evaluations/batch.py` around lines 144 - 156, The except
block in start_stt_evaluation_batch currently raises a new Exception which loses
the original traceback; change the final raise to preserve the original
exception (either re-raise the caught exception or use raise Exception(...) from
e) so the cause chain is kept, and keep the existing logging and
update_stt_result calls (referencing update_stt_result, STTResultStatus,
sample_to_result, and logger) intact.
| logger.info( | ||
| f"[upload_stt_dataset] Uploading STT dataset | name={name} | " | ||
| f"sample_count={len(samples)} | org_id={organization_id} | " | ||
| f"project_id={project_id}" | ||
| ) | ||
|
|
||
| # Step 1: Convert samples to CSV and upload to object store | ||
| object_store_url = _upload_samples_to_object_store( | ||
| session=session, | ||
| project_id=project_id, | ||
| dataset_name=name, | ||
| samples=samples, | ||
| ) | ||
|
|
||
| # Step 2: Calculate metadata | ||
| metadata = { | ||
| "sample_count": len(samples), | ||
| "has_ground_truth_count": sum(1 for s in samples if s.ground_truth), | ||
| } | ||
|
|
||
| # Step 3: Create dataset record | ||
| dataset = create_stt_dataset( | ||
| session=session, | ||
| name=name, | ||
| org_id=organization_id, | ||
| project_id=project_id, | ||
| description=description, | ||
| language_id=language_id, | ||
| object_store_url=object_store_url, | ||
| dataset_metadata=metadata, | ||
| ) | ||
|
|
||
| logger.info( | ||
| f"[upload_stt_dataset] Created dataset record | " | ||
| f"id={dataset.id} | name={name}" | ||
| ) | ||
|
|
||
| # Step 4: Create sample records | ||
| created_samples = create_stt_samples( | ||
| session=session, | ||
| dataset=dataset, | ||
| samples=samples, | ||
| ) | ||
|
|
||
| logger.info( | ||
| f"[upload_stt_dataset] Created sample records | " | ||
| f"dataset_id={dataset.id} | sample_count={len(created_samples)}" | ||
| ) | ||
|
|
||
| return dataset, created_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, verify the existence and structure of the CRUD file
fd 'dataset.py' backend/app/crud/stt_evaluations/Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 118
🏁 Script executed:
# Check for session.commit() calls in the CRUD code
rg -n 'session\.commit\(\)' backend/app/crud/stt_evaluations/dataset.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 127
🏁 Script executed:
# Also check the service file to see the full context and any transaction handling
cat -n backend/app/services/stt_evaluations/dataset.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 6087
🏁 Script executed:
# Let's examine the create_stt_dataset and create_stt_samples functions in the CRUD file
cat -n backend/app/crud/stt_evaluations/dataset.py | head -100Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 3443
🏁 Script executed:
# Get the section around line 74 and 198 to see what functions they're in
sed -n '60,90p' backend/app/crud/stt_evaluations/dataset.py
sed -n '185,210p' backend/app/crud/stt_evaluations/dataset.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 1569
🏁 Script executed:
# Check if there's any transaction/session management wrapping in the service or a middleware
rg -n 'try|except|rollback|begin|transaction' backend/app/services/stt_evaluations/dataset.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 118
🏁 Script executed:
# Check the imports and session type to understand if it's auto-commit or explicit
rg -n 'from sqlmodel|from sqlalchemy' backend/app/services/stt_evaluations/dataset.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 105
🏁 Script executed:
# Get the full create_stt_samples function
sed -n '155,210p' backend/app/crud/stt_evaluations/dataset.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 1681
🏁 Script executed:
# Check if there's any API-level session management or rollback handling
rg -n 'upload_stt_dataset' backend/app --type py -A 5 -B 5 | head -50Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 4299
🏁 Script executed:
# Check if the API endpoint or caller wraps this in error handling
rg -n 'upload_stt_dataset' backend/app/api --type py -B 5 -A 10Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 2672
🏁 Script executed:
# Also check if there's any router/endpoint level transaction management
fd 'router.py\|route.py' backend/app/api --type py | head -5Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 244
🏁 Script executed:
# Get the full endpoint code
cat -n backend/app/api/routes/stt_evaluations/dataset.py | head -80Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 3174
🏁 Script executed:
# Check if there's any Depends or session management in the route
rg -n 'get_session\|SessionDep\|Depends' backend/app/api/routes/stt_evaluations/dataset.py | head -10Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 57
🏁 Script executed:
# Check the core dependencies to understand session management
rg -n 'SessionDep\|get_session' backend/app/api/deps.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 57
🏁 Script executed:
# Check the validate_file_ids function to see if it can throw exceptions
rg -n 'def validate_file_ids' backend/app/crud/stt_evaluations/dataset.py -A 20Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 690
🏁 Script executed:
# Check if the deps.py file exists to understand SessionDep
fd 'deps.py' backend/app/api/ | xargs cat -n | head -50Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 1903
🏁 Script executed:
# Final verification: check if there's any error handling at API level that might clean up orphaned datasets
sed -n '37,80p' backend/app/api/routes/stt_evaluations/dataset.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 1591
Dataset and samples are created in separate, non-transactional commits — partial failure leaves an orphaned dataset.
create_stt_dataset (Step 3) commits the dataset to the database, then create_stt_samples (Step 4) commits samples separately. If sample creation fails (e.g., validate_file_ids raises HTTPException for invalid file IDs), the dataset persists in the database without any samples. The API endpoint has no error handling to roll back or clean up the orphaned dataset.
Consider wrapping both operations in a single database transaction using session.begin_nested() with explicit rollback on failure, or implement cleanup logic that removes the dataset if sample creation fails.
🤖 Prompt for AI Agents
In `@backend/app/services/stt_evaluations/dataset.py` around lines 54 - 103,
upload_stt_dataset currently calls create_stt_dataset and then
create_stt_samples in separate commits, which can leave an orphan dataset if
sample creation (e.g., validate_file_ids) fails; wrap both dataset and sample
creation inside a single DB transaction (use session.begin_nested() or
session.begin()) so that a failure during create_stt_samples triggers a
rollback, and if you prefer cleanup instead, catch exceptions around
create_stt_samples, call session.delete(dataset) and session.commit() (or
rollback then delete) to remove the dataset; update upload_stt_dataset to
perform these changes and reference the existing functions create_stt_dataset,
create_stt_samples and the failing validator validate_file_ids to ensure all DB
writes are atomic.
| metadata = { | ||
| "sample_count": len(samples), | ||
| "has_ground_truth_count": sum(1 for s in samples if s.ground_truth), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has_ground_truth_count uses truthiness check — empty string "" would be counted as no ground truth.
sum(1 for s in samples if s.ground_truth) treats "" as falsy. If a sample's ground_truth is explicitly set to "", it won't be counted. If the intent is to count non-None values, use s.ground_truth is not None instead.
🤖 Prompt for AI Agents
In `@backend/app/services/stt_evaluations/dataset.py` around lines 69 - 72, The
metadata calculation currently counts ground truth by truthiness which skips
empty strings; update the has_ground_truth_count computation to count samples
where s.ground_truth is not None (e.g., replace the truthiness check with an
explicit "s.ground_truth is not None") so empty string ground truths are
counted; change the generator used to compute metadata["has_ground_truth_count"]
in the dataset.py code that builds metadata for samples.
| @pytest.fixture | ||
| def test_dataset_with_samples( | ||
| self, db: Session, user_api_key: TestAuthContext | ||
| ) -> EvaluationDataset: | ||
| """Create a test dataset with samples for evaluation.""" | ||
| dataset = create_test_stt_dataset( | ||
| db=db, | ||
| organization_id=user_api_key.organization_id, | ||
| project_id=user_api_key.project_id, | ||
| name="eval_test_dataset", | ||
| ) | ||
| # Create some samples (file will be created automatically) | ||
| for i in range(3): | ||
| file = create_test_file( | ||
| db=db, | ||
| organization_id=user_api_key.organization_id, | ||
| project_id=user_api_key.project_id, | ||
| object_store_url=f"s3://bucket/audio_{i}.mp3", | ||
| filename=f"audio_{i}.mp3", | ||
| ) | ||
| create_test_stt_sample( | ||
| db=db, | ||
| dataset_id=dataset.id, | ||
| organization_id=user_api_key.organization_id, | ||
| project_id=user_api_key.project_id, | ||
| file_id=file.id, | ||
| ) | ||
| return dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixture test_dataset_with_samples is unused and has a stale sample_count.
This fixture is defined but never referenced by any test method. Additionally, create_test_stt_dataset initializes dataset_metadata={"sample_count": 0, ...}, so even if this fixture were used in a start_stt_evaluation test, the endpoint would reject it with "Dataset has no samples" (evaluation.py line 65).
Proposed fix — update metadata after adding samples
for i in range(3):
file = create_test_file(
db=db,
organization_id=user_api_key.organization_id,
project_id=user_api_key.project_id,
object_store_url=f"s3://bucket/audio_{i}.mp3",
filename=f"audio_{i}.mp3",
)
create_test_stt_sample(
db=db,
dataset_id=dataset.id,
organization_id=user_api_key.organization_id,
project_id=user_api_key.project_id,
file_id=file.id,
)
+ # Update metadata to reflect actual sample count
+ dataset.dataset_metadata = {"sample_count": 3, "has_ground_truth_count": 0}
+ db.add(dataset)
+ db.commit()
+ db.refresh(dataset)
return dataset🤖 Prompt for AI Agents
In `@backend/app/tests/api/routes/test_stt_evaluation.py` around lines 514 - 541,
The fixture test_dataset_with_samples creates samples but leaves the dataset
metadata (created via create_test_stt_dataset) with a stale sample_count=0,
which will make evaluations reject it; after creating the three samples with
create_test_file and create_test_stt_sample, update the EvaluationDataset
instance returned by create_test_stt_dataset by setting its
dataset_metadata["sample_count"] to the actual number of created samples (e.g.,
3) and persist that change using the provided db Session (flush/commit or the
project/dataset update helper) so the dataset reflects the correct sample_count
when used by start_stt_evaluation.
| def test_start_stt_evaluation_invalid_dataset( | ||
| self, | ||
| client: TestClient, | ||
| user_api_key_header: dict[str, str], | ||
| ) -> None: | ||
| """Test starting an STT evaluation with invalid dataset ID.""" | ||
| response = client.post( | ||
| "/api/v1/evaluations/stt/runs", | ||
| json={ | ||
| "run_name": "test_run", | ||
| "dataset_id": 99999, | ||
| "providers": ["gemini-2.5-pro"], | ||
| }, | ||
| headers=user_api_key_header, | ||
| ) | ||
|
|
||
| assert response.status_code == 404 | ||
| response_data = response.json() | ||
| error_str = response_data.get("detail", response_data.get("error", "")) | ||
| assert "not found" in error_str.lower() | ||
|
|
||
| def test_start_stt_evaluation_without_authentication( | ||
| self, | ||
| client: TestClient, | ||
| ) -> None: | ||
| """Test starting an STT evaluation without authentication.""" | ||
| response = client.post( | ||
| "/api/v1/evaluations/stt/runs", | ||
| json={ | ||
| "run_name": "test_run", | ||
| "dataset_id": 1, | ||
| "providers": ["gemini-2.5-pro"], | ||
| }, | ||
| ) | ||
|
|
||
| assert response.status_code == 401 | ||
|
|
||
| def test_list_stt_runs_empty( | ||
| self, | ||
| client: TestClient, | ||
| user_api_key_header: dict[str, str], | ||
| ) -> None: | ||
| """Test listing STT runs when none exist.""" | ||
| response = client.get( | ||
| "/api/v1/evaluations/stt/runs", | ||
| headers=user_api_key_header, | ||
| ) | ||
|
|
||
| assert response.status_code == 200 | ||
| response_data = response.json() | ||
| assert response_data["success"] is True | ||
| assert isinstance(response_data["data"], list) | ||
|
|
||
| def test_get_stt_run_not_found( | ||
| self, | ||
| client: TestClient, | ||
| user_api_key_header: dict[str, str], | ||
| ) -> None: | ||
| """Test getting a non-existent STT run.""" | ||
| response = client.get( | ||
| "/api/v1/evaluations/stt/runs/99999", | ||
| headers=user_api_key_header, | ||
| ) | ||
|
|
||
| assert response.status_code == 404 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
No happy-path test for starting an STT evaluation run.
All tests in TestSTTEvaluationRun cover error/edge paths (invalid dataset, missing auth, empty list, not found). The core flow — creating a run against a valid dataset with samples and verifying the response — is untested. This is the primary feature of the PR and warrants at least one test (mocking start_stt_evaluation_batch to avoid external calls).
🤖 Prompt for AI Agents
In `@backend/app/tests/api/routes/test_stt_evaluation.py` around lines 543 - 607,
Add a happy-path test in the TestSTTEvaluationRun suite that posts to
"/api/v1/evaluations/stt/runs" with a valid dataset (create or fixture with
samples) and a run_name/providers payload, mock the start_stt_evaluation_batch
function to avoid external calls, assert a 200 response (or 201 per API) and
that response.json() contains success True and expected run metadata, and verify
the mocked start_stt_evaluation_batch was called once with the created dataset
id and the provided providers/run_name; reference the test helper names used in
this file and the start_stt_evaluation_batch symbol for locating the code to
mock and assert.
|
|
||
|
|
||
| @router.post( | ||
| "/runs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can optimize on the number of API calls here; current flow:
- n /files calls for n samples
- POST /datasets call
- POST /runs call
the POST /datasets call can be consumed inside the POST /runs call; can we instead do:
- n /files calls for n samples (in the future, we can see if we can optimize this even, but fine for now)
- POST /runs call takes in a list of audio sample ids, creates a dataset entry in the db, proceeds on to the evaluation run logic
the GET endpoints for /datasets can still remain, but the POST endpoint for it seems redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we follow same sequence in evaluation also create dataset separately and reuse same dataset in multiple evaluations.
It may look too many API calls when running from swagger but with UI its good to have these API call, we can discuss more over call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@backend/app/api/routes/stt_evaluations/evaluation.py`:
- Around line 116-127: The except block in start_stt_evaluation currently
returns str(e) to the client, leaking internal error details; instead keep
logging the full error and saving the detailed message via
update_stt_run(session=_session, run_id=run.id, status="failed",
error_message=str(e)) but change the HTTPException detail to a generic message
(optionally include a non-sensitive identifier like run.id or an internal error
code) so the client receives no internal stack/secret data; locate the except
block that references logger.error, update_stt_run, and raise HTTPException and
replace the HTTPException detail payload accordingly.
- Around line 62-80: The run's total_items is being computed from
dataset.dataset_metadata.sample_count which can be stale; instead fetch the
actual samples first (the variable named samples used later) and compute
total_items = len(samples) * len(run_create.providers) before calling
create_stt_run (update the create_stt_run call to pass that computed value).
Ensure you still fall back to dataset.dataset_metadata.get("sample_count", 0)
only if samples is empty or the samples fetch fails, and keep using language_id
= dataset.language_id and other create_stt_run parameters unchanged.
🧹 Nitpick comments (7)
backend/app/models/stt_evaluation.py (5)
74-82:default_factory=dictproduces{}but column isnullable=True— pick one semantic.The Python default is
{}(viadefault_factory=dict), sosample_metadatawill never beNonewhen created through the ORM without an explicitNoneassignment, yet the DB column allowsNULL. This may cause confusion for downstream consumers checkingis Nonevs== {}. Consider aligning: either usedefault=None(matchingnullable=True) or setnullable=False, server_default=text("'{}'").
136-147:providerandstatusonSTTResultare barestr— consider enum validation or constraints.
statushas anSTTResultStatusenum defined but the column is typed asstrwith no DB-level check constraint. Similarly,providercould drift fromSUPPORTED_STT_PROVIDERS. At minimum, adding aCheckConstraintonstatusensures DB-level integrity.
273-277:STTFeedbackUpdateallows an empty payload (both fieldsNone) — this is a no-op.If both
is_correctandcommentareNone, the update request does nothing meaningful. Consider a model-level validator to require at least one field.Example validator
class STTFeedbackUpdate(BaseModel): """Request model for updating human feedback on a result.""" is_correct: bool | None = Field(None, description="Is the transcription correct?") comment: str | None = Field(None, description="Feedback comment") + + `@model_validator`(mode="after") + def check_at_least_one_field(self) -> "STTFeedbackUpdate": + if self.is_correct is None and self.comment is None: + raise ValueError("At least one of 'is_correct' or 'comment' must be provided") + return self
108-112:updated_atusesdefault_factory=nowwhich only fires on INSERT—but CRUD updates explicitly set it, making this a best-practice refactoring suggestion.Both
STTSample.updated_atandSTTResult.updated_atusedefault_factory=now, which only executes on INSERT. However, inspection of the CRUD layer shows that all update operations explicitly setupdated_at = now()(e.g.,backend/app/crud/stt_evaluations/run.py:222,backend/app/crud/stt_evaluations/result.py:238). While this explicit management works, consider adding SQLAlchemy'sonupdateto the column definition for additional safety against future updates that might miss manual assignment.Example using sa_column with onupdate
- updated_at: datetime = SQLField( - default_factory=now, - nullable=False, - sa_column_kwargs={"comment": "Timestamp when the sample was last updated"}, - ) + updated_at: datetime = SQLField( + default_factory=now, + nullable=False, + sa_column_kwargs={ + "comment": "Timestamp when the sample was last updated", + "onupdate": now, + }, + )Also applies to: 216-219
232-265: Consider addingmodel_config = ConfigDict(from_attributes=True)toSTTSamplePublicandSTTResultPublicfor idiomatic Pydantic usage.These models could benefit from Pydantic v2's
from_attributes=Trueconfiguration. While the current explicit keyword argument construction (e.g.,STTSamplePublic(id=sample.id, file_id=sample.file_id, ...)) works correctly, adoptingfrom_attributes=Truewould enable the idiomatic patternmodel_validate(orm_instance)and reduce duplication across construction sites.backend/app/api/routes/stt_evaluations/evaluation.py (2)
137-155: Verbose manual field-by-field model construction — usemodel_validateorfrom_orm.Both
STTEvaluationRunPublicandSTTEvaluationRunWithResultsare constructed by manually mapping every field from the ORM object. This is error-prone (easy to miss a field when the model evolves) and verbose. If you addmodel_config = ConfigDict(from_attributes=True)to the Pydantic models, you can replace this with:STTEvaluationRunPublic.model_validate(run)This would also simplify
STTEvaluationRunWithResultsconstruction.Also applies to: 233-253
190-206: Response model mismatch wheninclude_results=False.The endpoint declares
response_model=APIResponse[STTEvaluationRunWithResults], but wheninclude_results=False,resultswill be an empty list andresults_totalwill be 0. While this technically validates, it's semantically misleading — the response schema always advertises results. Consider using a union type or separate endpoint, or at minimum document this behavior clearly.
| sample_count = (dataset.dataset_metadata or {}).get("sample_count", 0) | ||
|
|
||
| if sample_count == 0: | ||
| raise HTTPException(status_code=400, detail="Dataset has no samples") | ||
|
|
||
| # Use language_id from the dataset | ||
| language_id = dataset.language_id | ||
|
|
||
| # Create run record | ||
| run = create_stt_run( | ||
| session=_session, | ||
| run_name=run_create.run_name, | ||
| dataset_id=run_create.dataset_id, | ||
| dataset_name=dataset.name, | ||
| org_id=auth_context.organization_.id, | ||
| project_id=auth_context.project_.id, | ||
| providers=run_create.providers, | ||
| language_id=language_id, | ||
| total_items=sample_count * len(run_create.providers), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
total_items derived from metadata may diverge from actual sample count.
sample_count is read from dataset.dataset_metadata (line 62), but the actual samples are fetched separately on line 84. If metadata becomes stale (e.g., samples added/removed without metadata update), total_items stored on the run will be incorrect. Consider computing total_items from len(samples) after fetching them.
Proposed fix
+ # Get samples for the dataset
+ samples = get_samples_by_dataset_id(
+ session=_session,
+ dataset_id=run_create.dataset_id,
+ org_id=auth_context.organization_.id,
+ project_id=auth_context.project_.id,
+ )
+
+ if not samples:
+ raise HTTPException(status_code=400, detail="Dataset has no samples")
+
# Create run record
run = create_stt_run(
session=_session,
run_name=run_create.run_name,
dataset_id=run_create.dataset_id,
dataset_name=dataset.name,
org_id=auth_context.organization_.id,
project_id=auth_context.project_.id,
providers=run_create.providers,
language_id=language_id,
- total_items=sample_count * len(run_create.providers),
+ total_items=len(samples) * len(run_create.providers),
)
-
- # Get samples for the dataset
- samples = get_samples_by_dataset_id(
- session=_session,
- dataset_id=run_create.dataset_id,
- org_id=auth_context.organization_.id,
- project_id=auth_context.project_.id,
- )🤖 Prompt for AI Agents
In `@backend/app/api/routes/stt_evaluations/evaluation.py` around lines 62 - 80,
The run's total_items is being computed from
dataset.dataset_metadata.sample_count which can be stale; instead fetch the
actual samples first (the variable named samples used later) and compute
total_items = len(samples) * len(run_create.providers) before calling
create_stt_run (update the create_stt_run call to pass that computed value).
Ensure you still fall back to dataset.dataset_metadata.get("sample_count", 0)
only if samples is empty or the samples fetch fails, and keep using language_id
= dataset.language_id and other create_stt_run parameters unchanged.
| except Exception as e: | ||
| logger.error( | ||
| f"[start_stt_evaluation] Batch submission failed | " | ||
| f"run_id: {run.id}, error: {str(e)}" | ||
| ) | ||
| update_stt_run( | ||
| session=_session, | ||
| run_id=run.id, | ||
| status="failed", | ||
| error_message=str(e), | ||
| ) | ||
| raise HTTPException(status_code=500, detail=f"Batch submission failed: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internal error details leaked to the client.
str(e) is included in the HTTP 500 response detail, which may expose internal implementation details (stack traces, service URLs, credentials in connection strings, etc.) to API consumers.
Proposed fix
- raise HTTPException(status_code=500, detail=f"Batch submission failed: {e}")
+ raise HTTPException(status_code=500, detail="Batch submission failed")🤖 Prompt for AI Agents
In `@backend/app/api/routes/stt_evaluations/evaluation.py` around lines 116 - 127,
The except block in start_stt_evaluation currently returns str(e) to the client,
leaking internal error details; instead keep logging the full error and saving
the detailed message via update_stt_run(session=_session, run_id=run.id,
status="failed", error_message=str(e)) but change the HTTPException detail to a
generic message (optionally include a non-sensitive identifier like run.id or an
internal error code) so the client receives no internal stack/secret data;
locate the except block that references logger.error, update_stt_run, and raise
HTTPException and replace the HTTPException detail payload accordingly.
| ) | ||
| return cls(api_key=api_key) | ||
|
|
||
| def validate_connection(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once the client instantiation is successful this check is rendered redundant unless there is a wifi outage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🤖 Fix all issues with AI agents
In `@backend/app/api/docs/stt_evaluation/start_evaluation.md`:
- Around line 1-8: The docs refer to "providers" but the API expects "models";
update the wording in this doc so it matches the STTEvaluationRunCreate contract
by replacing mentions of "providers" with "models" and noting the field name
`models: list[str]` where examples or supported options are listed (e.g., change
"**Supported providers:** gemini-2.5-pro" to "**Supported models:**
gemini-2.5-pro") and ensure any descriptive text referencing "providers" (lines
describing processing through providers) instead mentions processing through the
specified `models`.
In `@backend/app/crud/stt_evaluations/batch.py`:
- Around line 133-198: The code only saves first_batch_job_id
(first_batch_job_id) to the EvaluationRun so only the first model's batch job is
discoverable by poll_all_pending_stt_evaluations (which uses
EvaluationRun.batch_job_id); fix by persisting all created batch job IDs instead
of only the first: collect batch_job.id values from start_batch_job into
batch_jobs (or a list batch_job_ids) and call update_stt_run with that
collection (e.g., update_stt_run(session, run_id=run.id, status="processing",
batch_job_id=None, metadata={"batch_job_ids": batch_job_ids}) or add a new
parameter/field to store batch_job_ids); alternatively, if multi-model runs are
not supported, add an explicit assertion or comment near the loop and avoid
creating multiple jobs — reference functions/vars: start_batch_job, batch_jobs,
first_batch_job_id, update_stt_run, and poll_all_pending_stt_evaluations when
making the change.
In `@backend/app/crud/stt_evaluations/cron.py`:
- Around line 282-284: Remove the unused boolean any_succeeded (set at
any_succeeded = False and later modified) or else use it in the final-status
calculation; specifically, edit the block that defines/updates all_terminal,
any_failed, and any_succeeded (search for any_succeeded and any_failed in
cron.py) and either delete all references and assignments to any_succeeded, or
integrate it into the terminal-state logic (e.g., use any_succeeded to
distinguish "all failed" vs "partially succeeded" when computing the overall
status) and remove any linter warnings. Ensure no other code relies on
any_succeeded before removing.
- Around line 435-452: The loop currently skips when result_record is None which
hides failures; update the branch handling to log a warning (including
identifiers like the batch_result payload or an id/metadata if available) and
increment failed_count so missing DB rows are visible; specifically, in the
block where you check result_record (the variable result_record inside the loop
that later calls update_stt_result), add a processLogger.warning (or existing
logger) call describing the missing result_record and the batch_result context,
and increment failed_count (not processed_count) so metrics reflect the skipped
result instead of silently continuing.
- Around line 346-363: The run finalization currently leaves runs stuck in
"processing" when all batches are terminal but there remain pending results;
modify the finalization logic in the block using count_results_by_status,
STTResultStatus, any_failed, errors and update_stt_run so that if all_terminal
is True and pending > 0 you treat the run as terminal (set status to "completed"
or "failed") and populate error_message accordingly (e.g., include failed_count
and a message about unresolved pending results or list errors); ensure
update_stt_run is called with the terminal status and a clear error_message so
subsequent cron cycles won't reprocess the same run.
In `@backend/app/models/collection.py`:
- Around line 42-48: The provider Field's description mentions 'google' but the
ProviderType enum (ProviderType) does not include a GOOGLE/GEMINI member; fix by
either adding a GOOGLE (or GEMINI) enum value to ProviderType (ensure downstream
serialization/name matches existing enum style) or update the provider Field
description in collection.py to only list supported providers (e.g., 'openai')
so OpenAPI/docs match the actual ProviderType; modify the ProviderType
definition or the description string in the provider Field accordingly and run
tests/type checks to confirm no regressions.
🧹 Nitpick comments (15)
backend/app/core/providers.py (1)
1-4: Consider modernizing type hints to Python 3.11+ builtins.Pre-existing code uses
Dict,List,Optionalfromtyping(lines 2, 22, 26, 62, 86). Since the coding guidelines require Python 3.11+, these can be replaced with the built-indict,list, andX | Nonesyntax. Not blocking, but worth a follow-up cleanup. As per coding guidelines: "Use Python 3.11+ with type hints throughout the codebase."backend/app/models/stt_evaluation.py (4)
15-17: Remove commented-out code with duplicate entries.The commented-out list on line 16 contains
"gemini-2.5-pro"twice (likely intended to be"gemini-2.5-flash"). Remove the dead comment to avoid confusion.Proposed fix
-# Supported STT models for evaluation -# SUPPORTED_STT_MODELS = ["gemini-2.5-pro", "gemini-2.5-pro", "gemini-2.0-flash"] -SUPPORTED_STT_MODELS = ["gemini-2.5-pro"] +# Supported STT models for evaluation +SUPPORTED_STT_MODELS = ["gemini-2.5-pro"]
327-341: Confusing variable names invalidate_models.The parameter
valid_modelreads as a singular validated model, but it's actually the full list being validated. The loop variablemodelsshadows the field name conceptually and reads as plural when it represents a single model string.Proposed fix
`@field_validator`("models") `@classmethod` - def validate_models(cls, valid_model: list[str]) -> list[str]: + def validate_models(cls, models: list[str]) -> list[str]: """Validate that all models are supported.""" - if not valid_model: + if not models: raise ValueError("At least one model must be specified") unsupported = [ - models for models in valid_model if models not in SUPPORTED_STT_MODELS + m for m in models if m not in SUPPORTED_STT_MODELS ] if unsupported: raise ValueError( f"Unsupported model(s): {', '.join(unsupported)}. " f"Supported models are: {', '.join(SUPPORTED_STT_MODELS)}" ) - return valid_model + return models
36-113: Consider adding database indexes on frequently queried foreign keys.
STTSamplewill be queried bydataset_id(listing samples for a dataset) andSTTResultbyevaluation_run_idandstt_sample_id(listing results for a run, looking up results for a sample). Without indexes on these columns, those queries will degrade as data grows.Example: adding table-level indexes
from sqlalchemy import Index class STTSample(SQLModel, table=True): __tablename__ = "stt_sample" __table_args__ = ( Index("idx_stt_sample_dataset", "dataset_id"), ) # ... fields ... class STTResult(SQLModel, table=True): __tablename__ = "stt_result" __table_args__ = ( Index("idx_stt_result_run", "evaluation_run_id"), Index("idx_stt_result_sample", "stt_sample_id"), ) # ... fields ...Also applies to: 116-221
274-278: No-op feedback update is silently accepted.Both
is_correctandcommentdefault toNone. A request body{}passes validation but performs a no-op update. Consider adding a model-level validator to require at least one field.Proposed fix
class STTFeedbackUpdate(BaseModel): """Request model for updating human feedback on a result.""" is_correct: bool | None = Field(None, description="Is the transcription correct?") comment: str | None = Field(None, description="Feedback comment") + + `@model_validator`(mode="after") + def check_at_least_one_field(self) -> "STTFeedbackUpdate": + if self.is_correct is None and self.comment is None: + raise ValueError("At least one of 'is_correct' or 'comment' must be provided") + return selfbackend/app/crud/evaluations/processing.py (1)
669-674: Use theEvaluationTypeenum for consistency.The STT polling in
cron.pyusesEvaluationType.STT.value, but here the type filter uses a raw string"text". Using the enum constant is safer against typos and keeps the codebase consistent.Proposed fix
+from app.models.stt_evaluation import EvaluationType + # Single query to fetch all processing text evaluation runs # STT/TTS evaluations have their own polling statement = select(EvaluationRun).where( EvaluationRun.status == "processing", - EvaluationRun.type == "text", + EvaluationRun.type == EvaluationType.TEXT.value, )backend/app/crud/stt_evaluations/batch.py (1)
123-124: UseRuntimeError(or a domain exception) instead of bareException.Raising bare
Exceptionmakes it harder for callers to catch specific errors. This also applies to line 190.Proposed fix
if not signed_urls: - raise Exception("Failed to generate signed URLs for any audio files") + raise RuntimeError("Failed to generate signed URLs for any audio files") ... if not batch_jobs: - raise Exception("Batch submission failed for all models") + raise RuntimeError("Batch submission failed for all models")Also applies to: 189-190
backend/app/tests/api/routes/test_stt_evaluation.py (1)
14-99: Consider moving factory helpers to a shared conftest or utility module.The
create_test_file,create_test_stt_dataset, andcreate_test_stt_samplehelpers are well-structured factory functions. Per the project's testing guidelines, these could be extracted to a shared test utilities module (e.g.,backend/app/tests/utils/stt_factories.py) so they can be reused by other test files that may need STT test data. As per coding guidelines, "Use factory pattern for test fixtures inbackend/app/tests/".backend/app/core/storage_utils.py (1)
139-206:upload_jsonl_to_object_storeduplicates the upload logic instead of delegating toupload_to_object_store.The new
upload_to_object_storewas introduced as the generic upload function, andupload_csv_to_object_storecorrectly delegates to it. However,upload_jsonl_to_object_storestill manually createsUploadFile, callsstorage.put, and duplicates the same error handling. This defeats the purpose of the generic function.Proposed fix
def upload_jsonl_to_object_store( storage: CloudStorage, results: list[dict], filename: str, subdirectory: str, format: Literal["json", "jsonl"] = "jsonl", ) -> str | None: ... try: - file_path = Path(subdirectory) / filename - if format == "jsonl": jsonl_content = ( "\n".join(json.dumps(result, ensure_ascii=False) for result in results) + "\n" ) - content_type = {"content-type": "application/jsonl"} + ct = "application/jsonl" else: jsonl_content = json.dumps(results, ensure_ascii=False) - content_type = {"content-type": "application/json"} + ct = "application/json" content_bytes = jsonl_content.encode("utf-8") - headers = Headers(content_type) - upload_file = UploadFile( - filename=filename, - file=BytesIO(content_bytes), - headers=headers, + return upload_to_object_store( + storage=storage, + content=content_bytes, + filename=filename, + subdirectory=subdirectory, + content_type=ct, ) - - destination = storage.put(source=upload_file, file_path=file_path) - object_store_url = str(destination) - ... except CloudStorageError as e: ...backend/app/tests/core/test_storage_utils.py (2)
359-369: Uniqueness test doesn't assert uniqueness.The test is named
test_unique_filenamesbut only asserts prefix format — it never comparesfilename1 != filename2. Combined with the comment acknowledging same-second collisions, this test adds no value beyondtest_timestamp_format.Either assert inequality or remove the test to avoid misleading coverage.
Proposed fix
def test_unique_filenames(self): """Test that consecutive calls produce different filenames.""" import time filename1 = generate_timestamped_filename("test") - time.sleep(0.01) # Small delay to ensure different timestamp + time.sleep(1.1) # Ensure different second-level timestamp filename2 = generate_timestamped_filename("test") - # They may be the same if called in the same second - # but the format should be correct - assert filename1.startswith("test_") - assert filename2.startswith("test_") + assert filename1 != filename2, "Consecutive filenames should differ"
24-27: Add return type hints to test methods for consistency.Several test methods lack
-> Nonereturn type annotations (e.g.,test_mp3_url,test_wav_url,test_successful_upload, etc.), while others liketest_upload_json_file_successhave them. As per coding guidelines, all functions should have type hints on return values.Also applies to: 29-32, 97-109, 166-176, 197-219, 327-331, 333-337, 353-357
backend/app/crud/stt_evaluations/run.py (1)
218-220: Cannot explicitly clear a field toNone(e.g.,error_message).The
if value is not Noneguard means callers can never reset a field — for instance, clearingerror_messageon a retry. If intentional, document it; otherwise, consider a sentinel or explicit include-list.Example using a sentinel
+_UNSET = object() + def update_stt_run( *, session: Session, run_id: int, - status: str | None = None, - score: dict[str, Any] | None = None, - error_message: str | None = None, - object_store_url: str | None = None, - batch_job_id: int | None = None, + status: str | None = _UNSET, + score: dict[str, Any] | None = _UNSET, + error_message: str | None = _UNSET, + object_store_url: str | None = _UNSET, + batch_job_id: int | None = _UNSET, ) -> EvaluationRun | None: ... for field, value in updates.items(): - if value is not None: + if value is not _UNSET: setattr(run, field, value)backend/app/crud/stt_evaluations/result.py (3)
22-77:session.flush()beforesession.commit()is redundant.
commit()already callsflush()internally. The extraflush()on line 69 adds no value.Proposed fix
session.add_all(results) - session.flush() session.commit()
247-299: CRUD layer raisesHTTPException— mixing concerns.
update_human_feedbackraisesHTTPException(line 280), coupling the data-access layer to HTTP semantics. If this function is called from a non-HTTP context (e.g., a script, background job), the HTTP exception is inappropriate.Return
Noneand let the API route handler raise the 404.Proposed fix
-from app.core.exception_handlers import HTTPException ... def update_human_feedback( ... ) -> STTResult | None: ... result = get_stt_result_by_id(...) if not result: - raise HTTPException(status_code=404, detail="Result not found") + return None ...Then in the API route, check for
Noneand raiseHTTPException(status_code=404, ...)there.
350-358: Simplify dict comprehension per Ruff C416.Proposed fix
- return {status: count for status, count in rows} + return dict(rows)
| Start an STT evaluation run on a dataset. | ||
|
|
||
| The evaluation will: | ||
| 1. Process each audio sample through the specified providers | ||
| 2. Generate transcriptions using Gemini Batch API | ||
| 3. Store results for human review | ||
|
|
||
| **Supported providers:** gemini-2.5-pro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation uses "providers" but the API model uses "models".
STTEvaluationRunCreate defines the field as models: list[str], but this doc references "providers" (lines 4 and 8). Update the terminology to match the API contract to avoid confusing consumers.
Proposed fix
Start an STT evaluation run on a dataset.
The evaluation will:
-1. Process each audio sample through the specified providers
+1. Process each audio sample through the specified models
2. Generate transcriptions using Gemini Batch API
3. Store results for human review
-**Supported providers:** gemini-2.5-pro
+**Supported models:** gemini-2.5-pro📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Start an STT evaluation run on a dataset. | |
| The evaluation will: | |
| 1. Process each audio sample through the specified providers | |
| 2. Generate transcriptions using Gemini Batch API | |
| 3. Store results for human review | |
| **Supported providers:** gemini-2.5-pro | |
| Start an STT evaluation run on a dataset. | |
| The evaluation will: | |
| 1. Process each audio sample through the specified models | |
| 2. Generate transcriptions using Gemini Batch API | |
| 3. Store results for human review | |
| **Supported models:** gemini-2.5-pro |
🤖 Prompt for AI Agents
In `@backend/app/api/docs/stt_evaluation/start_evaluation.md` around lines 1 - 8,
The docs refer to "providers" but the API expects "models"; update the wording
in this doc so it matches the STTEvaluationRunCreate contract by replacing
mentions of "providers" with "models" and noting the field name `models:
list[str]` where examples or supported options are listed (e.g., change
"**Supported providers:** gemini-2.5-pro" to "**Supported models:**
gemini-2.5-pro") and ensure any descriptive text referencing "providers" (lines
describing processing through providers) instead mentions processing through the
specified `models`.
| # Submit one batch job per model | ||
| batch_jobs: dict[str, Any] = {} | ||
| first_batch_job_id: int | None = None | ||
|
|
||
| for model in models: | ||
| model_path = f"models/{model}" | ||
| batch_provider = GeminiBatchProvider( | ||
| client=gemini_client.client, model=model_path | ||
| ) | ||
|
|
||
| try: | ||
| batch_job = start_batch_job( | ||
| session=session, | ||
| provider=batch_provider, | ||
| provider_name="google", | ||
| job_type="stt_evaluation", | ||
| organization_id=org_id, | ||
| project_id=project_id, | ||
| jsonl_data=jsonl_data, | ||
| config={ | ||
| "model": model, | ||
| "stt_provider": model, | ||
| "evaluation_run_id": run.id, | ||
| }, | ||
| ) | ||
|
|
||
| batch_jobs[model] = { | ||
| "batch_job_id": batch_job.id, | ||
| "provider_batch_id": batch_job.provider_batch_id, | ||
| } | ||
|
|
||
| if first_batch_job_id is None: | ||
| first_batch_job_id = batch_job.id | ||
|
|
||
| logger.info( | ||
| f"[start_stt_evaluation_batch] Batch job created | " | ||
| f"run_id: {run.id}, model: {model}, " | ||
| f"batch_job_id: {batch_job.id}" | ||
| ) | ||
|
|
||
| except Exception as e: | ||
| logger.error( | ||
| f"[start_stt_evaluation_batch] Failed to submit batch | " | ||
| f"model: {model}, error: {str(e)}" | ||
| ) | ||
| pending = get_pending_results_for_run( | ||
| session=session, run_id=run.id, provider=model | ||
| ) | ||
| for result in pending: | ||
| update_stt_result( | ||
| session=session, | ||
| result_id=result.id, | ||
| status=STTResultStatus.FAILED.value, | ||
| error_message=f"Batch submission failed for {model}: {str(e)}", | ||
| ) | ||
|
|
||
| if not batch_jobs: | ||
| raise Exception("Batch submission failed for all models") | ||
|
|
||
| # Link first batch job to the evaluation run (for pending run detection) | ||
| update_stt_run( | ||
| session=session, | ||
| run_id=run.id, | ||
| status="processing", | ||
| batch_job_id=first_batch_job_id, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the first batch job is linked to the evaluation run — multi-model runs will lose track of subsequent jobs.
first_batch_job_id (line 164-165) records only the first model's batch job. The cron poller in poll_all_pending_stt_evaluations discovers runs via EvaluationRun.batch_job_id, so batch jobs for any additional models won't be polled or processed. While currently only one model is supported, this code path explicitly loops over multiple models and would silently drop results for all but the first.
Consider either: (a) creating one EvaluationRun per model, (b) storing all batch job IDs (e.g., in the run's metadata/config), or (c) adding a comment/assertion that multi-model is not yet fully supported.
🤖 Prompt for AI Agents
In `@backend/app/crud/stt_evaluations/batch.py` around lines 133 - 198, The code
only saves first_batch_job_id (first_batch_job_id) to the EvaluationRun so only
the first model's batch job is discoverable by poll_all_pending_stt_evaluations
(which uses EvaluationRun.batch_job_id); fix by persisting all created batch job
IDs instead of only the first: collect batch_job.id values from start_batch_job
into batch_jobs (or a list batch_job_ids) and call update_stt_run with that
collection (e.g., update_stt_run(session, run_id=run.id, status="processing",
batch_job_id=None, metadata={"batch_job_ids": batch_job_ids}) or add a new
parameter/field to store batch_job_ids); alternatively, if multi-model runs are
not supported, add an explicit assertion or comment near the loop and avoid
creating multiple jobs — reference functions/vars: start_batch_job, batch_jobs,
first_batch_job_id, update_stt_run, and poll_all_pending_stt_evaluations when
making the change.
| all_terminal = True | ||
| any_succeeded = False | ||
| any_failed = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any_succeeded is assigned but never used.
The variable is set on lines 283 and 329 but never read. Static analysis (Ruff F841) confirms this. Either remove it or incorporate it into the final status logic if the intent was to distinguish "all failed" from "partially succeeded."
Proposed fix — remove unused variable
all_terminal = True
- any_succeeded = False
any_failed = False
errors: list[str] = []
for batch_job in batch_jobs:
provider_name = batch_job.config.get("stt_provider", "unknown")
# Skip batch jobs already in terminal state that have been processed
if batch_job.provider_status in TERMINAL_STATES:
if batch_job.provider_status == BatchJobState.SUCCEEDED.value:
- any_succeeded = True
+ pass
else:
any_failed = True
...
if provider_status == BatchJobState.SUCCEEDED.value:
await process_completed_stt_batch(
session=session,
run=run,
batch_job=batch_job,
batch_provider=batch_provider,
)
- any_succeeded = True
else:Also applies to: 329-329
🤖 Prompt for AI Agents
In `@backend/app/crud/stt_evaluations/cron.py` around lines 282 - 284, Remove the
unused boolean any_succeeded (set at any_succeeded = False and later modified)
or else use it in the final-status calculation; specifically, edit the block
that defines/updates all_terminal, any_failed, and any_succeeded (search for
any_succeeded and any_failed in cron.py) and either delete all references and
assignments to any_succeeded, or integrate it into the terminal-state logic
(e.g., use any_succeeded to distinguish "all failed" vs "partially succeeded"
when computing the overall status) and remove any linter warnings. Ensure no
other code relies on any_succeeded before removing.
| # All batch jobs are done - finalize the run | ||
| status_counts = count_results_by_status(session=session, run_id=run.id) | ||
| pending = status_counts.get(STTResultStatus.PENDING.value, 0) | ||
| failed_count = status_counts.get(STTResultStatus.FAILED.value, 0) | ||
|
|
||
| final_status = "completed" if pending == 0 else "processing" | ||
| error_message = None | ||
| if any_failed: | ||
| error_message = "; ".join(errors) | ||
| elif failed_count > 0: | ||
| error_message = f"{failed_count} transcription(s) failed" | ||
|
|
||
| update_stt_run( | ||
| session=session, | ||
| run_id=run.id, | ||
| status=final_status, | ||
| error_message=error_message, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run stuck in "processing" forever when pending results remain after all batches are terminal.
When all_terminal is True but pending > 0 (e.g., unmatched custom_ids or missing result records), final_status is set to "processing" (line 351). On the next cron cycle, all batch jobs are already terminal and skipped (line 291-299), so the same finalization logic re-runs and sets "processing" again — creating an infinite loop.
If all batches have completed but results are still pending, those results will never be resolved. Mark the run as "completed" (or "failed") with an appropriate error message instead.
Proposed fix
- final_status = "completed" if pending == 0 else "processing"
+ if pending > 0:
+ # All batches are terminal but some results weren't matched/processed
+ final_status = "completed"
+ if error_message:
+ error_message += f"; {pending} result(s) still unresolved"
+ else:
+ error_message = f"{pending} result(s) still unresolved after all batches completed"
+ else:
+ final_status = "completed"🤖 Prompt for AI Agents
In `@backend/app/crud/stt_evaluations/cron.py` around lines 346 - 363, The run
finalization currently leaves runs stuck in "processing" when all batches are
terminal but there remain pending results; modify the finalization logic in the
block using count_results_by_status, STTResultStatus, any_failed, errors and
update_stt_run so that if all_terminal is True and pending > 0 you treat the run
as terminal (set status to "completed" or "failed") and populate error_message
accordingly (e.g., include failed_count and a message about unresolved pending
results or list errors); ensure update_stt_run is called with the terminal
status and a clear error_message so subsequent cron cycles won't reprocess the
same run.
| if result_record: | ||
| if batch_result.get("response"): | ||
| text = batch_result["response"].get("text", "") | ||
| update_stt_result( | ||
| session=session, | ||
| result_id=result_record.id, | ||
| transcription=text, | ||
| status=STTResultStatus.COMPLETED.value, | ||
| ) | ||
| processed_count += 1 | ||
| else: | ||
| update_stt_result( | ||
| session=session, | ||
| result_id=result_record.id, | ||
| status=STTResultStatus.FAILED.value, | ||
| error_message=batch_result.get("error", "Unknown error"), | ||
| ) | ||
| failed_count += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silently skips results when result_record is not found — no logging or error counting.
If result_record is None (line 433), the loop continues implicitly without any warning. This makes it impossible to diagnose why results are stuck in PENDING after batch completion. At minimum, log a warning and increment failed_count.
Proposed fix
result_record = session.exec(stmt).one_or_none()
if result_record:
if batch_result.get("response"):
...
else:
...
+ else:
+ logger.warning(
+ f"[process_completed_stt_batch] No result record found | "
+ f"run_id={run.id}, sample_id={sample_id}, provider={stt_provider}"
+ )
+ failed_count += 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if result_record: | |
| if batch_result.get("response"): | |
| text = batch_result["response"].get("text", "") | |
| update_stt_result( | |
| session=session, | |
| result_id=result_record.id, | |
| transcription=text, | |
| status=STTResultStatus.COMPLETED.value, | |
| ) | |
| processed_count += 1 | |
| else: | |
| update_stt_result( | |
| session=session, | |
| result_id=result_record.id, | |
| status=STTResultStatus.FAILED.value, | |
| error_message=batch_result.get("error", "Unknown error"), | |
| ) | |
| failed_count += 1 | |
| if result_record: | |
| if batch_result.get("response"): | |
| text = batch_result["response"].get("text", "") | |
| update_stt_result( | |
| session=session, | |
| result_id=result_record.id, | |
| transcription=text, | |
| status=STTResultStatus.COMPLETED.value, | |
| ) | |
| processed_count += 1 | |
| else: | |
| update_stt_result( | |
| session=session, | |
| result_id=result_record.id, | |
| status=STTResultStatus.FAILED.value, | |
| error_message=batch_result.get("error", "Unknown error"), | |
| ) | |
| failed_count += 1 | |
| else: | |
| logger.warning( | |
| f"[process_completed_stt_batch] No result record found | " | |
| f"run_id={run.id}, sample_id={sample_id}, provider={stt_provider}" | |
| ) | |
| failed_count += 1 |
🤖 Prompt for AI Agents
In `@backend/app/crud/stt_evaluations/cron.py` around lines 435 - 452, The loop
currently skips when result_record is None which hides failures; update the
branch handling to log a warning (including identifiers like the batch_result
payload or an id/metadata if available) and increment failed_count so missing DB
rows are visible; specifically, in the block where you check result_record (the
variable result_record inside the loop that later calls update_stt_result), add
a processLogger.warning (or existing logger) call describing the missing
result_record and the batch_result context, and increment failed_count (not
processed_count) so metrics reflect the skipped result instead of silently
continuing.
| provider: ProviderType = ( | ||
| Field( | ||
| nullable=False, | ||
| description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'gemini', etc)", | ||
| description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'google', etc)", | ||
| sa_column_kwargs={"comment": "LLM provider used for this collection"}, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description references 'google' but ProviderType enum doesn't include it.
The description on line 45 lists 'google' as an example provider, but the ProviderType enum (lines 15–20) only defines openai—GEMINI is commented out and there's no GOOGLE member. This is misleading for API consumers who see the field description (e.g., in OpenAPI docs). Either add GOOGLE to ProviderType or keep the description aligned with what's actually accepted.
🤖 Prompt for AI Agents
In `@backend/app/models/collection.py` around lines 42 - 48, The provider Field's
description mentions 'google' but the ProviderType enum (ProviderType) does not
include a GOOGLE/GEMINI member; fix by either adding a GOOGLE (or GEMINI) enum
value to ProviderType (ensure downstream serialization/name matches existing
enum style) or update the provider Field description in collection.py to only
list supported providers (e.g., 'openai') so OpenAPI/docs match the actual
ProviderType; modify the ProviderType definition or the description string in
the provider Field accordingly and run tests/type checks to confirm no
regressions.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Summary
Target issue is #533
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
New Features
Functional Requirements Testing
Summary by CodeRabbit
New Features
Documentation
Tests