-
Notifications
You must be signed in to change notification settings - Fork 3
Job status monitoring for chuck data #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job status monitoring for chuck data #36
Conversation
7d0e9d3 to
afe5103
Compare
bc5316d to
cdc8dc8
Compare
| ) | ||
|
|
||
| if response.status_code == 200 or response.status_code == 201: | ||
| if response.status_code in (200, 201, 204): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old bug. chuck-api endpoint actually returns a 204 response, hence the change. Maybe someone changed the response code and forgot to update this. Kept the old job statuses just to be safe.
| payload_str = json.dumps(sanitized_payload) | ||
| logging.debug(f"Sending metric: {payload_str[:100]}...") | ||
|
|
||
| return self._client.submit_metrics(payload, token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old existing bug in chuck-data. This request always failed as there were some unserialisable data in the json. Sanitised it using pydantic and it works now.
3516cd2 to
11b2322
Compare
chuck_data/commands/job_status.py
Outdated
| if ( | ||
| fetch_live | ||
| and databricks_run_id | ||
| and databricks_run_id != "UNSET_DATABRICKS_RUN_ID" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the job has finished running in databricks, the cluster will remove the databricks run id from it's env. So if we try to fetch "live" stats, the ai endpoint will result in an error, hence this line.
11b2322 to
165ebe6
Compare
| return CommandResult(False, message="Either --job-id or --run-id is required") | ||
|
|
||
| try: | ||
| # Get the job run status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted all this out into a method _extract_databricks_run_info above.
| databricks_raw = client.get_job_run_status(databricks_run_id) | ||
| job_data["databricks_live"] = _extract_databricks_run_info(databricks_raw) | ||
|
|
||
| # Format output message - build a comprehensive summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be extracted as a separate method _format_job_status_message... with an example about how it would look for a given use case... I think in later iteration this formatting may change a lot..
pragyan-amp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM...
Already looks great... just few nitpicks:
UNSET_DATABRICKS_RUN_IDextract in a constant... as it's being compared at two different places..- format message in
_query_by_job_idmethod as commented...
Approving the PR as things look find..
Implement comprehensive job status monitoring in chuck-data CLI using job-id as the primary identifier and Chuck backend as the source of truth. AmperityAPIClient (chuck_data/clients/amperity.py): - Add get_job_status(job_id, token) to query Chuck backend for job state - Add record_job_submission(databricks_run_id, token, job_id) to link identifiers - Both methods use Amperity API with JWT (CLI token) authentication Job-status command (chuck_data/commands/job_status.py): - Refactor into three private helper functions for better testability: * _extract_databricks_run_info() for cleaning Databricks API responses * _query_by_job_id() for primary Chuck backend queries * _query_by_run_id() for legacy Databricks API fallback - Use job-id as primary parameter, run_id as legacy fallback - Support --live flag to enrich Chuck data with Databricks telemetry - Extract and structure task information, durations, and cluster details - Handle UNSET_DATABRICKS_RUN_ID gracefully (skip Databricks API call) Stitch tools (chuck_data/commands/stitch_tools.py): - Extract job-id from /api/job/launch response during prepare phase - Propagate job-id through metadata (prepare → launch phases) - Call record_job_submission() after Databricks job launch - Pass job-id in request body for CLI token authentication - Return job-id in launch results for monitoring - Add non-fatal error handling with graceful degradation The TUI now tracks job-id as the primary identifier and queries status from the Chuck backend, enabling proper state tracking (:pending → :submitted → :running → :succeeded/:failed) with Chuck telemetry (record counts, credits, errors) while maintaining backward compatibility with run_id-based queries. Related to CHUCK-3
Add 100+ tests covering job status monitoring implementation across all layers: AmperityAPIClient, job-status command, and stitch-tools integration. AmperityAPIClient tests (tests/unit/clients/test_amperity.py): - get_job_status() with success (200), not found (404), and network errors - record_job_submission() with success (200/201), failures (4xx/5xx), network errors - Payload format validation (kebab-case keys, correct headers) - Job-id parameter inclusion in request body - Authentication token handling and error propagation Job-status command tests (tests/unit/commands/test_job_status.py): Integration tests (11 tests): - Query by job_id (primary) using Chuck backend with/without live data - Query by run_id (legacy) using Databricks API with task information - Authentication handling (missing/invalid tokens) - Error handling and message formatting (credits, records, errors) Unit tests for private functions (9 tests): - _extract_databricks_run_info(): basic extraction, with tasks, without tasks - _query_by_job_id(): basic query, live data enrichment, missing token - _query_by_run_id(): basic query, missing client, not found Stitch-tools tests (tests/unit/commands/test_stitch_tools.py): - Job-id extraction from /api/job/launch response during prepare phase - Job-id propagation through metadata (prepare → launch phases) - Job-id returned in launch results for monitoring - record_job_submission() invocation with correct parameters - Error handling when job-id is missing from API response - Edge cases: missing token, missing job-id (graceful degradation) The refactored code structure with private helper functions enables better test isolation and validates both high-level integration flows and low-level data transformation logic. All 27 tests passing. Related to CHUCK-3
165ebe6 to
e736d20
Compare
Both points addressed 👍 |
chuck > /job-status --job-id chk-20251029-61073-VZdjbQfw66r --live Job chk-20251029-61073-VZdjbQfw66r: succeeded, Records: 7,155,216, Build: stitch-service-build/7766-6a34055, Created: 2025-10-29T16:57:53.890Z, Started: 2025-10-29T17:06:20.427060788Z, Ended: 2025-10-29T17:13:46.892749542Z, Duration: 7.4mAdded a new job status command ⬆️
Will be doing cosmetic changes i.e. properly displaying the job status in a nice table in subsequent PRs.