Skip to content

Conversation

@punit-naik-amp
Copy link
Contributor

@punit-naik-amp punit-naik-amp commented Oct 28, 2025

chuck > /job-status --job-id chk-20251029-61073-VZdjbQfw66r --live
Job chk-20251029-61073-VZdjbQfw66r: succeeded, Records: 7,155,216, Build: stitch-service-build/7766-6a34055, Created: 2025-10-29T16:57:53.890Z, Started: 2025-10-29T17:06:20.427060788Z, Ended: 2025-10-29T17:13:46.892749542Z, Duration: 7.4m

Added a new job status command ⬆️
Will be doing cosmetic changes i.e. properly displaying the job status in a nice table in subsequent PRs.

@punit-naik-amp punit-naik-amp force-pushed the CHUCK-3-job-status-monitoring-for-chuck-data branch from 7d0e9d3 to afe5103 Compare October 28, 2025 09:12
@pragyan-amp pragyan-amp marked this pull request as ready for review October 28, 2025 14:04
@pragyan-amp pragyan-amp marked this pull request as draft October 28, 2025 14:23
@punit-naik-amp punit-naik-amp force-pushed the CHUCK-3-job-status-monitoring-for-chuck-data branch 5 times, most recently from bc5316d to cdc8dc8 Compare October 28, 2025 20:06
)

if response.status_code == 200 or response.status_code == 201:
if response.status_code in (200, 201, 204):
Copy link
Contributor Author

@punit-naik-amp punit-naik-amp Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old bug. chuck-api endpoint actually returns a 204 response, hence the change. Maybe someone changed the response code and forgot to update this. Kept the old job statuses just to be safe.

payload_str = json.dumps(sanitized_payload)
logging.debug(f"Sending metric: {payload_str[:100]}...")

return self._client.submit_metrics(payload, token)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old existing bug in chuck-data. This request always failed as there were some unserialisable data in the json. Sanitised it using pydantic and it works now.

@punit-naik-amp punit-naik-amp force-pushed the CHUCK-3-job-status-monitoring-for-chuck-data branch from 3516cd2 to 11b2322 Compare October 30, 2025 09:07
@punit-naik-amp punit-naik-amp self-assigned this Oct 30, 2025
if (
fetch_live
and databricks_run_id
and databricks_run_id != "UNSET_DATABRICKS_RUN_ID"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the job has finished running in databricks, the cluster will remove the databricks run id from it's env. So if we try to fetch "live" stats, the ai endpoint will result in an error, hence this line.

@punit-naik-amp punit-naik-amp force-pushed the CHUCK-3-job-status-monitoring-for-chuck-data branch from 11b2322 to 165ebe6 Compare October 30, 2025 09:15
return CommandResult(False, message="Either --job-id or --run-id is required")

try:
# Get the job run status
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted all this out into a method _extract_databricks_run_info above.

@punit-naik-amp punit-naik-amp marked this pull request as ready for review October 30, 2025 09:22
databricks_raw = client.get_job_run_status(databricks_run_id)
job_data["databricks_live"] = _extract_databricks_run_info(databricks_raw)

# Format output message - build a comprehensive summary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be extracted as a separate method _format_job_status_message... with an example about how it would look for a given use case... I think in later iteration this formatting may change a lot..

Copy link
Contributor

@pragyan-amp pragyan-amp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM...

Already looks great... just few nitpicks:

  • UNSET_DATABRICKS_RUN_ID extract in a constant... as it's being compared at two different places..
  • format message in _query_by_job_id method as commented...

Approving the PR as things look find..

Implement comprehensive job status monitoring in chuck-data CLI using job-id
as the primary identifier and Chuck backend as the source of truth.

AmperityAPIClient (chuck_data/clients/amperity.py):
- Add get_job_status(job_id, token) to query Chuck backend for job state
- Add record_job_submission(databricks_run_id, token, job_id) to link identifiers
- Both methods use Amperity API with JWT (CLI token) authentication

Job-status command (chuck_data/commands/job_status.py):
- Refactor into three private helper functions for better testability:
  * _extract_databricks_run_info() for cleaning Databricks API responses
  * _query_by_job_id() for primary Chuck backend queries
  * _query_by_run_id() for legacy Databricks API fallback
- Use job-id as primary parameter, run_id as legacy fallback
- Support --live flag to enrich Chuck data with Databricks telemetry
- Extract and structure task information, durations, and cluster details
- Handle UNSET_DATABRICKS_RUN_ID gracefully (skip Databricks API call)

Stitch tools (chuck_data/commands/stitch_tools.py):
- Extract job-id from /api/job/launch response during prepare phase
- Propagate job-id through metadata (prepare → launch phases)
- Call record_job_submission() after Databricks job launch
- Pass job-id in request body for CLI token authentication
- Return job-id in launch results for monitoring
- Add non-fatal error handling with graceful degradation

The TUI now tracks job-id as the primary identifier and queries status from
the Chuck backend, enabling proper state tracking (:pending → :submitted →
:running → :succeeded/:failed) with Chuck telemetry (record counts, credits,
errors) while maintaining backward compatibility with run_id-based queries.

Related to CHUCK-3
Add 100+ tests covering job status monitoring implementation across all layers:
AmperityAPIClient, job-status command, and stitch-tools integration.

AmperityAPIClient tests (tests/unit/clients/test_amperity.py):
- get_job_status() with success (200), not found (404), and network errors
- record_job_submission() with success (200/201), failures (4xx/5xx), network errors
- Payload format validation (kebab-case keys, correct headers)
- Job-id parameter inclusion in request body
- Authentication token handling and error propagation

Job-status command tests (tests/unit/commands/test_job_status.py):
Integration tests (11 tests):
- Query by job_id (primary) using Chuck backend with/without live data
- Query by run_id (legacy) using Databricks API with task information
- Authentication handling (missing/invalid tokens)
- Error handling and message formatting (credits, records, errors)

Unit tests for private functions (9 tests):
- _extract_databricks_run_info(): basic extraction, with tasks, without tasks
- _query_by_job_id(): basic query, live data enrichment, missing token
- _query_by_run_id(): basic query, missing client, not found

Stitch-tools tests (tests/unit/commands/test_stitch_tools.py):
- Job-id extraction from /api/job/launch response during prepare phase
- Job-id propagation through metadata (prepare → launch phases)
- Job-id returned in launch results for monitoring
- record_job_submission() invocation with correct parameters
- Error handling when job-id is missing from API response
- Edge cases: missing token, missing job-id (graceful degradation)

The refactored code structure with private helper functions enables better
test isolation and validates both high-level integration flows and low-level
data transformation logic. All 27 tests passing.

Related to CHUCK-3
@punit-naik-amp punit-naik-amp force-pushed the CHUCK-3-job-status-monitoring-for-chuck-data branch from 165ebe6 to e736d20 Compare October 30, 2025 19:46
@punit-naik-amp
Copy link
Contributor Author

LGTM...

Already looks great... just few nitpicks:

  • UNSET_DATABRICKS_RUN_ID extract in a constant... as it's being compared at two different places..
  • format message in _query_by_job_id method as commented...

Approving the PR as things look find..

Both points addressed 👍

@punit-naik-amp punit-naik-amp merged commit fa66aba into main Oct 30, 2025
2 checks passed
@punit-naik-amp punit-naik-amp deleted the CHUCK-3-job-status-monitoring-for-chuck-data branch October 30, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants