IGNORE: Add comprehensive Flink job details to paasta status for SQLClient (with -vv flag) #4162

nleigh · 2025-11-19T14:53:29Z

POC ignore

Hackathon work
Just an example of new output to share with the core streaming team
If happy with changes, will split into smaller PR's to make reviewing easier

Summary

Adds detailed SQLClient Flink job information to paasta status output when using the -vv flag. This includes Kafka topology, job parallelism, resource optimization, UDF links, consumer group monitoring, and instance-specific config links.

Motivation

When troubleshooting SQLClient Flink jobs, operators currently need to:

Manually open Pipeline Studio to find source/sink topics
Check multiple dashboards for consumer lag
Calculate optimal taskmanager configuration
Search through config files to find UDF plugins
Look up consumer group names manually

This PR consolidates all critical troubleshooting information into a single command: paasta status -vv

Impact: ~90% reduction in time to gather troubleshooting context (from 5-10 minutes → 30 seconds)

Features

1. Instance-Specific Config Links

Yelpsoa configs: Jumps to exact line number (e.g., #L21)
Srv configs: Links to instance directory
UDF: Links to UDF plugin code (when configured)

2. Data Pipeline Topology

Shows sources and sinks count
Consumer group name with Kafka View and Grafana links (environment-aware)
For each source/sink:
- Schema ID, namespace, source, alias
- Pipeline Studio link (smart routing)
- Datapipe describe command
- Datapipe tail command
- Primary keys (for sinks)

3. Job Parallelism

Shows configured parallelism in Jobs table
Column order: Job Name | State | Parallelism | Job ID | Started

4. Resource Utilization & Optimization

Current taskmanager instances and slot configuration
Utilization percentage
Suggests reducing overprovisioned resources (< 50% utilization)
Provides exact yelpsoa-configs changes needed

Example Output

paasta status -s sqlclient -i happyhour -c pnw-prod -vv

sqlclient.happyhour in pnw-prod (EKS)
    Version:    f1d27524 (desired)
    ...
    Yelpsoa configs: https://github.yelpcorp.com/sysgit/yelpsoa-configs/blob/master/sqlclient/flinkeks-pnw-prod.yaml#L21
    Srv configs: https://github.yelpcorp.com/sysgit/srv-configs/tree/master/ecosystem/prod/sqlclient/happyhour
    UDF: https://github.yelpcorp.com/misc/sqlclient_plugins/tree/main/udf/account_search_formatting_udfs
==================================================================
    Data Pipeline Topology:
      Sources: 1 topics
      Sinks:   1 topics
      Consumer Group: flink.sqlclient.happyhour.happyhour
        Kafka View: http://kafka-view.admin.yelp.com/clusters/scribe.uswest2-prod/groups/flink.sqlclient.happyhour.happyhour
        Grafana: https://grafana.yelpcorp.com/d/kcHXkIBnz/consumer-metrics?...

    Source Topics:
      1. heartbeat_messages
         Namespace:       scribeheartbeat
         Source:          scribe
         Alias:           v0.1
         Pipeline Studio: https://pipeline_studio_v2.yelpcorp.com/namespaces/scribeheartbeat/sources/scribe/asset-details?alias=v0.1
         Describe:        datapipe schema describe --namespace scribeheartbeat --source scribe --alias v0.1
         Tail:            datapipe stream tail --namespace scribeheartbeat --source scribe --alias v0.1 --all-fields --json

    Sink Topics:
      1. scribeheartbeat_count
         Namespace:       beam.to.sqlclient.testing
         Source:          scribeheartbeat_count
         Alias:           2.0
         Pipeline Studio: https://pipeline_studio_v2.yelpcorp.com/namespaces/beam.to.sqlclient.testing/sources/scribeheartbeat_count/asset-details?alias=2.0
         Describe:        datapipe schema describe --namespace beam.to.sqlclient.testing --source scribeheartbeat_count --alias 2.0
         Tail:            datapipe stream tail --namespace beam.to.sqlclient.testing --source scribeheartbeat_count --alias 2.0 --all-fields --json

==================================================================
    State: Running
    Pods: 3 running, 0 evicted, 0 other, 3 total
    Jobs: 1 running, 0 finished, 0 failed, 0 cancelled, 1 total
    1 taskmanagers, 0/1 slots available
    Jobs:
      Job Name  State   Parallelism Job ID                           Started
      happyhour Running 1           9e3d6d273acb86656ac18171674595bc 2025-11-18 10:16:01 (16 hours ago)
==================================================================
    Resource Utilization & Optimization:
      Current Configuration:
        Taskmanagers:     1 instances
        Slots per TM:     1 slots
        Total Slots:      1 slots
        Used Slots:       1 slots (100% utilization)
        Idle Slots:       0 slots

      ✅ Resource utilization is optimal
==================================================================

Implementation Details

Code Organization

paasta_tools/flink_tools.py: Core logic (+460 lines)
- Reads job configs from /nail/etc/srv-configs/.client/public/ecosystem/...
- Uses service_configuration_lib.read_yaml_file() for proper YAML handling
- Modular helper functions with URL constants
- Handles instance/job name mismatches with glob fallback
- Handles YAML float parsing (alias: 2.0 → "2.0")
paasta_tools/cli/cmds/status.py: Integration (+158 lines)
- Only displays for service == "sqlclient" and verbose >= 2
- Kafka topics positioned after monitoring links, before State
- Resource optimization at the bottom
- Captures job name for consumer group generation
- Instance-specific config link generation with grep for line numbers

New Helper Functions

_safe_str(): Safe type conversion handling None
get_sqlclient_job_config(): Parse job YAML with fallback paths
get_sqlclient_parallelism(): Extract parallelism value
get_sqlclient_udf_plugin(): Extract UDF plugin name
analyze_slot_utilization(): Calculate metrics and recommendations
format_resource_optimization(): Format optimization output
_format_topic_links_and_commands(): Generate Pipeline Studio/datapipe links
_format_consumer_group_info(): Format consumer group with monitoring links
_format_source_topics(): Format all source topics
_format_sink_topics(): Format all sink topics
format_kafka_topics(): Main orchestration

Testing

Unit Tests (42 tests, all passing)

TestSafeStr: String conversion (5 tests)
TestGetSqlclientJobConfig: YAML parsing with fallbacks (3 tests)
TestGetSqlclientParallelism: Parallelism extraction (2 tests)
TestGetSqlclientUdfPlugin: UDF plugin extraction (2 tests)
TestAnalyzeSlotUtilization: Utilization analysis (3 tests)
TestFormatResourceOptimization: Optimization formatting (2 tests)
TestFormatTopicLinksAndCommands: Link generation (2 tests)
TestFormatConsumerGroupInfo: Consumer group formatting (2 tests)
TestFormatSourceTopics: Source formatting (2 tests)
TestFormatSinkTopics: Sink formatting (2 tests)
TestFormatKafkaTopics: Integration tests (5 tests)

Integration Tests (11 tests, all passing)

Existing TestPrintFlinkStatus tests all pass

Manual Testing

Tested with multiple SQLClient instances:

✅ happyhour (simple, 1 taskmanager)
✅ account_search_v2 (complex, 4 sources, UDF)
✅ ad_indexing_sync_bounded_business_features (instance/job name mismatch)
✅ Both prod and devc environments

Technical Highlights

Handles Edge Cases

✅ Instance name != job name (uses glob to find YAML)
✅ YAML float parsing (alias: 2.0 parsed as float)
✅ Missing configs (graceful error messages)
✅ Multiple jobs per instance (finds first YAML)
✅ Instances not deployed locally (uses full srv-configs repo)
✅ No UDF configured (omits UDF link)
✅ Environment-specific URLs (prod vs devc kafka-view)

Performance Considerations

Single file read per invocation (cached by service_configuration_lib)
No external API calls (no latency impact)
Uses full srv-configs repo (260 instances available)

Usage

# Show detailed info for any SQLClient instance
paasta status -s sqlclient -i <instance> -c <cluster> -vv

# Examples:
paasta status -s sqlclient -i happyhour -c pnw-prod -vv
paasta status -s sqlclient -i account_search_v2 -c pnw-prod -vv
paasta status -s sqlclient -i lead_status -c pnw-devc -vv

Breaking Changes

None. All new functionality is behind -vv flag and only for sqlclient service.

Future Work

Potential enhancements (not included in this PR):

SQL query preview (from job YAML)
Job metadata tags (use_case, data_delay, owner)
Pod restart tracking (from Kubernetes events)
Checkpoint statistics (requires new paasta-api endpoint)
Expand to other Flink services beyond sqlclient

Related Work

Part of hackathon project: FLINK-5725
Previous CloudZero PR: Replace Splunk cost link with CloudZero in Flink status output #4152
Implementation docs: /nail/home/nathanleigh/source/hackathon2025/paasta_status_flink_improvements/

🤖 Generated with Claude Code

This adds a Data Pipeline Topology section to paasta status output for SQLClient Flink jobs when using the -vv flag. The section displays source and sink Kafka topics with helpful links and commands. Also adds Job Parallelism column to the Jobs table for SQLClient services, showing the configured parallelism from srv-configs. Features: - Parses SQLClient job config from srv-configs YAML files - Shows source and sink topology summary - Displays schema IDs, namespaces, sources, and aliases - Generates Pipeline Studio links (smart routing based on schema_id vs namespace/source) - Generates datapipe schema describe commands - Generates datapipe stream tail commands - Shows primary keys for sink topics - Shows job parallelism in Jobs table (sqlclient only) Implementation details: - Reads from /nail/srv/configs/{service}/{instance}/job.d/{instance}.yaml - Uses service_configuration_lib.read_yaml_file() for proper YAML handling - Handles YAML float parsing (alias: 2.0 becomes string "2.0") - Only displays when verbose >= 2 and service == "sqlclient" - Positioned after monitoring links, before State/Pods/Jobs info - Parallelism shown only for sqlclient service Example output: ================================================================== Data Pipeline Topology: Sources: 1 topics Sinks: 1 topics Source Topics: 1. heartbeat_messages Namespace: scribeheartbeat Source: scribe Alias: v0.1 Pipeline Studio: https://pipeline_studio_v2.yelpcorp.com/namespaces/scribeheartbeat/sources/scribe/asset-details?alias=v0.1 Describe: datapipe schema describe --namespace scribeheartbeat --source scribe --alias v0.1 Tail: datapipe stream tail --namespace scribeheartbeat --source scribe --alias v0.1 --all-fields --json Sink Topics: 1. scribeheartbeat_count Namespace: beam.to.sqlclient.testing Source: scribeheartbeat_count Alias: 2.0 Pipeline Studio: https://pipeline_studio_v2.yelpcorp.com/namespaces/beam.to.sqlclient.testing/sources/scribeheartbeat_count/asset-details?alias=2.0 Describe: datapipe schema describe --namespace beam.to.sqlclient.testing --source scribeheartbeat_count --alias 2.0 Tail: datapipe stream tail --namespace beam.to.sqlclient.testing --source scribeheartbeat_count --alias 2.0 --all-fields --json ================================================================== State: Running Pods: 3 running, 0 evicted, 0 other, 3 total Jobs: 1 running, 0 finished, 0 failed, 0 cancelled, 1 total 1 taskmanagers, 0/1 slots available Jobs: Job Name State Parallelism Job ID Started happyhour Running 1 9e3d6d273acb86656ac18171674595bc 2025-11-18 10:16:01 (16 hours ago) New helper functions in paasta_tools/flink_tools.py: - _safe_str(): Safely converts values to strings, handling None - get_sqlclient_job_config(): Parses job YAML and extracts config - get_sqlclient_parallelism(): Gets parallelism value from job config - format_kafka_topics(): Formats topic info for display Tests added (13 tests, all passing): - TestSafeStr: Tests string conversion helper (5 tests) - TestGetSqlclientJobConfig: Tests YAML parsing (3 tests) - TestGetSqlclientParallelism: Tests parallelism extraction (2 tests) - TestFormatKafkaTopics: Tests output formatting (3 tests) Related: FLINK-5725 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Reduce State column width from 11 to 8 characters when parallelism is displayed, creating tighter spacing between columns. Before: Job Name State Parallelism Job ID Started happyhour Running 1 9e3d6... 2025-11-18... After: Job Name State Parallelism Job ID Started happyhour Running 1 9e3d6d273acb86656ac18171674595bc 2025-11-18 10:16:01 (16 hours ago) This provides better visual alignment and makes the parallelism value appear closer to the state column.

Analyzes Flink slot utilization and suggests taskmanager configuration optimizations for sqlclient jobs when using -vv flag. Features: - Shows current taskmanager instances, slots, and utilization - Detects overprovisioning (< 50% utilization) and suggests reducing instances - Detects insufficient headroom (> 90% utilization) and suggests adding capacity - Provides exact yelpsoa-configs changes needed - Only shows suggestions when optimization is possible Example outputs: 1. Overprovisioned (suggests reducing): Current: 12 taskmanagers, 48 slots, 10 used (21% utilization) Suggests: Reduce to 4 taskmanagers for 75% utilization 2. Insufficient headroom (suggests increasing): Current: 12 taskmanagers, 48 slots, 48 used (100% utilization) Suggests: Increase to 16 taskmanagers for failover capacity 3. Optimal (no suggestion): Current: 1 taskmanager, 1 slot, 1 used (100% utilization) Message: Resource utilization is optimal New functions: - analyze_slot_utilization(): Calculates metrics and recommendations - format_resource_optimization(): Formats output for display

For SQLClient services, make config links more precise: 1. Yelpsoa configs: Link directly to the line number in flinkeks-{cluster}.yaml - Searches for 'instance:' line using grep - Falls back to directory link if line not found 2. Srv configs: Link to the specific instance directory - Links to ecosystem/{ecosystem}/sqlclient/{instance} Examples: - Prod: yelpsoa-configs/blob/master/sqlclient/flinkeks-pnw-prod.yaml#L21 - Devc: yelpsoa-configs/blob/master/sqlclient/flinkeks-pnw-devc.yaml#L18 - Srv: srv-configs/tree/master/ecosystem/prod/sqlclient/happyhour This saves developers time by jumping directly to the relevant config instead of having to search through the file or directory.

Use /nail/etc/srv-configs/.client/public/ecosystem/{ecosystem}/... directly instead of checking multiple fallback paths. This location contains all instances (260 vs 166 in hiera-merged location) and simplifies the code. Benefits: - Works for all SQLClient instances (not just deployed to current box) - Removes fallback path logic (~15 lines) - Single source of truth for configs - Handles instance/job name mismatch with glob fallback Also removes over-utilization warnings and simplifies optimization output: - Only suggests reducing overprovisioned resources (< 50% utilization) - Removes confusing warnings for high utilization (> 90%) - Simplifies config change output (removes 'File:' line) Tested with: - happyhour: parallelism=1 ✓ - ad_indexing_sync_bounded_business_features: parallelism=5 ✓ - All 24 tests passing ✓

For SQLClient jobs that use UDFs, add a link to the UDF plugin code in the sqlclient_plugins repository. Features: - Extracts plugin_name from job config's udf_config section - Generates GitHub link to misc/sqlclient_plugins/tree/main/udf/{plugin_name} - Only displays when UDF is configured - Shows below Srv configs link with -v flag Example output: Yelpsoa configs: https://github.yelpcorp.com/.../sqlclient/flinkeks-pnw-prod.yaml#L2269 Srv configs: https://github.yelpcorp.com/.../srv-configs/.../sqlclient/account_search_v2 UDF: https://github.yelpcorp.com/misc/sqlclient_plugins/tree/main/udf/account_search_formatting_udfs New function: - get_sqlclient_udf_plugin(): Extracts UDF plugin name from job config Tested with: - account_search_v2: Shows UDF link ✓ - happyhour: No UDF, no link shown ✓

Shows Flink consumer group name with links to monitoring tools in the Data Pipeline Topology section. Features: - Generates consumer group name: flink.{service}.{instance}.{job_name} - Uses actual job name from Flink (handles instance/job name mismatch) - Kafka View link (environment-aware): * Prod: http://kafka-view.admin.yelp.com (scribe.uswest2-prod) * Devc: http://kafka-view.paasta-norcal-devc.yelp (buff-high.uswest1-devc) - Grafana consumer metrics link with pre-filtered consumer group Example output: Data Pipeline Topology: Sources: 1 topics Sinks: 1 topics Consumer Group: flink.sqlclient.happyhour.happyhour Kafka View: http://kafka-view.admin.yelp.com/clusters/scribe.uswest2-prod/groups/flink.sqlclient.happyhour.happyhour Grafana: https://grafana.yelpcorp.com/d/kcHXkIBnz/consumer-metrics?... Tested with: - happyhour (prod): ✓ - happyhour (devc): ✓ - ad_indexing_sync_bounded_business_features: ✓

Adds comprehensive unit tests for recent SQLClient features: TestGetSqlclientUdfPlugin (2 tests): - Test extracting UDF plugin name from job config - Test returning None when no UDF configured TestAnalyzeSlotUtilization (3 tests): - Test underutilized (25%) suggests reducing instances - Test overutilized (100%) calculates increase suggestion - Test optimal utilization (70%) returns no suggestion TestFormatResourceOptimization (2 tests): - Test formatting optimization suggestions with config changes - Test formatting when utilization is optimal TestFormatKafkaTopics consumer group (2 tests): - Test prod environment uses kafka-view.admin.yelp.com - Test devc environment uses kafka-view.paasta-norcal-devc.yelp All 22 tests passing.

Improves code maintainability and testability by: 1. Extract URL constants (at top of file): - PIPELINE_STUDIO_BASE - KAFKA_VIEW_PROD_DOMAIN / KAFKA_VIEW_DEVC_DOMAIN - GRAFANA_BASE - SRV_CONFIGS_FULL_REPO 2. Extract topic formatting helper: - _format_topic_links_and_commands(): Generates Pipeline Studio link, datapipe describe/tail commands (eliminates ~60 lines of duplication) 3. Split format_kafka_topics into smaller functions: - _format_consumer_group_info(): Consumer group with Kafka View/Grafana links - _format_source_topics(): Formats all source topics - _format_sink_topics(): Formats all sink topics - format_kafka_topics(): Main orchestration (167 lines → 35 lines) Benefits: - Eliminated code duplication (sources vs sinks formatting) - Easier to maintain (URL changes in one place) - More testable (each helper function tested independently) - Clearer separation of concerns Tests added (8 new tests, 42 total): - TestFormatTopicLinksAndCommands: Test link/command generation (2 tests) - TestFormatConsumerGroupInfo: Test consumer group formatting (2 tests) - TestFormatSourceTopics: Test source formatting (2 tests) - TestFormatSinkTopics: Test sink formatting (2 tests) All 42 tests passing ✓ No functional changes, output identical to before.

Fixes CI failures from pre-commit hooks: 1. Remove unused imports: - re (not used) - subprocess (not used) - yaml (using service_configuration_lib.read_yaml_file instead) 2. Fix f-strings without placeholders (flake8 F541): - Line 665: Remove f-prefix from static string - Line 902: Remove f-prefix from static string - Line 905: Remove f-prefix from static string - Line 913: Remove f-prefix from static string 3. Add autospec=True to all mock.patch calls (5 occurrences): - Ensures mocks have same interface as original objects - Required by patch-enforce-autospec pre-commit hook All 42 tests still passing after fixes.

Add explicit type annotations for output variables in helper functions: - _format_source_topics: output: List[str] = [] - _format_sink_topics: output: List[str] = [] This resolves mypy errors: var-annotated: Need type annotation for 'output' All 42 tests still passing.

nleigh and others added 9 commits November 19, 2025 02:27

nleigh requested a review from a team as a code owner November 19, 2025 14:53

nleigh marked this pull request as draft November 19, 2025 14:53

nleigh changed the title ~~Add comprehensive Flink job details to paasta status for SQLClient (with -vv flag)~~ IGNORE: Add comprehensive Flink job details to paasta status for SQLClient (with -vv flag) Nov 19, 2025

nleigh added 2 commits November 19, 2025 07:37

nleigh mentioned this pull request Nov 20, 2025

Refactor Flink status to modular architecture #4165

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IGNORE: Add comprehensive Flink job details to paasta status for SQLClient (with -vv flag) #4162

IGNORE: Add comprehensive Flink job details to paasta status for SQLClient (with -vv flag) #4162

nleigh commented Nov 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IGNORE: Add comprehensive Flink job details to paasta status for SQLClient (with -vv flag) #4162

Are you sure you want to change the base?

IGNORE: Add comprehensive Flink job details to paasta status for SQLClient (with -vv flag) #4162

Conversation

nleigh commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

POC ignore

Summary

Motivation

Features

1. Instance-Specific Config Links

2. Data Pipeline Topology

3. Job Parallelism

4. Resource Utilization & Optimization

Example Output

Implementation Details

Code Organization

New Helper Functions

Testing

Unit Tests (42 tests, all passing)

Integration Tests (11 tests, all passing)

Manual Testing

Technical Highlights

Handles Edge Cases

Performance Considerations

Usage

Breaking Changes

Future Work

Related Work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nleigh commented Nov 19, 2025 •

edited

Loading