Skip to content

Add Posit Team Overview Dashboard#178

Open
t-margheim wants to merge 32 commits intomainfrom
dashboard-posit-team-overview
Open

Add Posit Team Overview Dashboard#178
t-margheim wants to merge 32 commits intomainfrom
dashboard-posit-team-overview

Conversation

@t-margheim
Copy link
Contributor

Description

This PR adds a new dashboard focused on Posit Team application metrics. I also included some basic documentation about how the dashboards work with the IaC.

Code Flow

The dashboard is defined in posit-team-overview.json, which is deployed in the cluster step as a ConfigMap. When Grafana spots a change to the ConfigMap, the corresponding dashboard is redeployed.

Category of change

  • Bug fix (non-breaking change which fixes an issue)
  • Version upgrade (upgrading the version of a service or product)
  • New feature (non-breaking change which adds functionality)
  • Build: a code change that affects the build system or external dependencies
  • Performance: a code change that improves performance
  • Refactor: a code change that neither fixes a bug nor adds a feature
  • Documentation: documentation changes
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

  • I have reviewed my own diff and added inline comments on lines I want reviewers to focus on or that I am uncertain about

t-margheim and others added 30 commits March 11, 2026 10:52
…erview dashboard

Add Grafana transformations and field overrides to the Running Version panel to improve
readability and presentation:

- Hide unnecessary "Value" and "Time" columns using organize transformation
- Rename columns to user-friendly names (Site, Product, Version, Cluster)
- Set logical column order (Site → Product → Version → Cluster)
- Configure appropriate column widths (Site: 200px, Product: 150px, Version: 120px, Cluster: 150px)

This change is presentation-only and does not modify the underlying Prometheus query.
The dashboard will now display a cleaner, more operator-friendly table view.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…overview

Fix three provisioning issues in the Posit Team Overview dashboard:

1. Set dashboard ID to null (was hardcoded to 48) to prevent import conflicts
   across different Grafana instances. Grafana auto-assigns IDs on import.

2. Add meaningful UID "posit_team_overview" (was empty string) for programmatic
   references and consistency with other dashboards.

3. Default cluster_name template variable to "All" (was hardcoded test cluster
   "default_duplicado03-staging-20250411-control-plane") to prevent exposure
   of internal cluster names and provide better default behavior.

These changes align the dashboard with provisioning best practices used in
other dashboards (alerts_dashboard.json, k8s-views-global.json).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…view transformations

The PromQL query at line 104 groups by (release_name, version, ptd_site) and does
not include cluster, but the transformation configuration still referenced cluster
in both indexByName and renameByName. This mismatch would cause the Cluster column
to be missing or empty in the rendered table.

Removed orphaned references:
- Deleted "cluster": 3 from indexByName
- Deleted "cluster": "Cluster" from renameByName

The transformation now correctly matches the query output, displaying only Site,
Product, and Version columns.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…mption gauge

Fixed two display issues in the "License Consumption by Site" gauge panel:

1. Division by Zero: Added value mappings to display "No Limit" (blue) for
   infinite values when license_seats is 0 (unlimited licenses), and "No Data"
   for NaN values when metrics are unavailable. This replaces the confusing
   infinity symbol (∞) with clear text.

2. Gauge Labels: Added field override using byName matcher to display only
   the site name (e.g., "site1") instead of showing the expression reference
   prefix (e.g., "C site1").

Changes to posit-team-overview.json:
- Added special value mappings in fieldConfig.defaults.mappings for inf and null+nan
- Replaced byValue override with byName override targeting expression "C"
- Set displayName to ${__field.labels.ptd_site} to extract only the site label

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- update dashboard ID to null
- correct UID format in Posit Team Overview
- correct infinity value mapping case
…d panels

Added cluster=~"$cluster_name" filter to all non-library panel queries to enable
per-cluster filtering in the Posit Team Overview dashboard. This allows users to
view metrics for specific clusters or all clusters using the dashboard variable.

Updated panels:
- Avg session start time (24h)
- Active IDE sessions
- Active user sessions
- Registered users
- Licensed users
- License expires
- Build info (both queries)
- Requests/min(1m)
- Avg resp secs (1m)
- Response size /min (kb)
- Request time quantiles (all 4 histogram queries)
- Active sessions

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
… Team Overview

Addresses review feedback by adding cluster=~"$cluster_name" filter to four
unlinked panels that were missing it:
- Panel 2: Avg request duration (secs)
- Panel 10: Requests in flight (second target)
- Panel 4: Session start secs
- Panel 8: Session start to join secs

Also restores cluster variable default to "All" ($__all) and increments
dashboard version from 1 to 7 to track evolution properly.

These changes ensure consistent cluster filtering behavior across all panels
when users select a specific cluster in the dashboard.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…osit Team Overview

Applied ptd_site="$site_name" filter to all 22 pwb_* metric queries across
the dashboard, ensuring the Site dropdown filters all panels consistently
(previously only 2 of 24 panels were filtered). Also incremented dashboard
version from 1 to 8 to properly continue version tracking.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…m Overview

Changed all site filters from ptd_site="$site_name" to ptd_site=~"$site_name"
to support regex pattern matching, which allows the "All" option in the Site
dropdown to work correctly (when $site_name is set to ".*").

Updated all 22 pwb_* metric queries across the dashboard.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…metrics

Changed "Registered users" panel from `max by(ptd_site, cluster)` to
`max by(cluster, ptd_site)` to match the consistent label ordering pattern
used throughout the dashboard and in the "Licensed users" panel.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added a complete "Connect" row with 17 panels that mirror the Workbench
metrics structure, providing visibility into Connect operational metrics.

Changes:
- Added Connect row at y=23 with 17 panels (IDs 100-117)
- Panels include HTTP metrics, user activity, queue metrics, and Shiny sessions
- Updated site_name variable query to include both pwb_build_info and go_build_info
- Updated dashboard version from 10 to 11
- Total panels increased from 18 to 36

Panel breakdown:
- 7 gauge panels for key metrics (content views, active sessions, users)
- 2 stat panels (build info, queue jobs)
- 8 timeseries panels for detailed monitoring

All panels use cluster and ptd_site filters for consistent filtering
across both Workbench and Connect sections.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Changed aggregation from sum to max for Connect metrics that are
reported identically by all pods, preventing double-counting when
multiple Connect pods are running.

Fixed panels:
- Panel 103: Active users (24h) - changed sum to max
- Panel 104: Active users (7d) - changed sum to max
- Panel 105: Active users (30d) - changed sum to max
- Panel 107: Queue jobs - changed sum to max

Rationale: connect_users_active and connect_jobs_queue_total_jobs_in_queue
represent global system state queried from shared resources (database, queue).
All Connect pods report identical values, so sum() incorrectly multiplies
by the number of pods. Using max() deduplicates across pods while preserving
the correct value.

This mirrors the fix in commit 29c44bf for Workbench license metrics.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed two panel titles where the title did not accurately reflect what
the query was calculating:

Panel 112 - Response size:
- Changed: "Response size/min (kb)" → "Response size/5m (kb)"
- Query: increase(...[5m]) / 1024
- Rationale: Query uses increase() over 5m window, which returns total
  bytes over 5 minutes, not per-minute rate. Title must match.
- This reverts the problematic title change from commit 14a8fa1 and
  restores the correct title from commit c8765d8.

Panel 109 - Request rate gauge:
- Changed: "Requests/5m" → "Requests/min (5m)"
- Query: rate(...[5m]) * 60
- Rationale: Query calculates rate (per-second) * 60 = per-minute rate.
  The "(5m)" clarifies the lookback window used for the rate calculation.
  Previous title "Requests/5m" incorrectly suggested total requests over
  5 minutes rather than a per-minute rate.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added a complete "Package Manager" row with 18 panels that provide
comprehensive monitoring of Package Manager operational metrics.

Changes:
- Added Package Manager row at y=46 with 18 panels (IDs 200-218)
- Dashboard version updated from 14 to 15
- Total panels increased from 36 to 54

Panel Categories:

Storage Health (critical for PPM):
- Storage used % (gauge) - Calculated percentage of storage consumed
- Storage free (GB) (gauge) - Available storage space
- Storage used over time (timeseries) - Historical usage by path

HTTP Performance (parallel to Workbench/Connect):
- Requests/5m (timeseries) - HTTP request volume
- Requests/min (5m) (gauge) - Current request rate
- Avg response size (timeseries) - Response size by status code
- Avg resp size (5m) (gauge) - Current average response size
- Response size/5m (kb) (timeseries) - Total response bytes
- Requests in flight (stat + timeseries) - Current/historical in-flight requests

Package Operations (unique to PPM):
- Package downloads/min (24h) (gauge) - Average daily download rate
- Package downloads/5m (timeseries) - Download activity by repo

Repository Sync Operations:
- Sync duration (timeseries) - p95 sync time by source type (CRAN, PyPI, etc.)

Binary Routing:
- Binary routing fallbacks (timeseries) - Failed binary routing by reason

License Status:
- License days left (gauge) - Days until license expiry
- License expires (stat) - Expiry timestamp
- Failed Git builds (stat) - Total failed Git builds
- Build info (stat) - Version information

All panels use cluster and ptd_site filters for consistent filtering
across all three product sections (Workbench, Connect, Package Manager).

Key metrics use max() aggregation to prevent double-counting when multiple
PPM pods report identical global state (storage, license).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed two issues in Package Manager panels:

Panel 202 - Package downloads/min (24h):
- Removed `@ end()` modifier from query
- Previous: increase(...[24h] @ end()) / (24 * 60)
- Fixed: increase(...[24h]) / (24 * 60)
- Rationale: The @ end() modifier locks the query to the range end
  timestamp, breaking historical playback in time-range dashboards and
  causing incorrect behavior in real-time views. For a gauge showing
  current download rate, the query should evaluate at the current
  dashboard time, not a fixed endpoint.

Panel 216 - Storage used (GB):
- Added missing unit configuration: "unit": "decgbytes"
- Rationale: Panel 204 (Storage free GB) uses "decgbytes" unit for
  consistent GB formatting. Panel 216 was displaying raw metric values
  without unit formatting, creating display inconsistency. Both panels
  now show storage metrics with consistent GB display formatting.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…left panel

Changed panel 203 (License days left) to display the value with a fixed
"days" suffix instead of allowing Grafana to auto-convert to larger time
units like weeks.

Changes:
- Unit changed from "d" (days) to "none"
- Added custom suffix: " days"

Previous behavior:
- Value of 43 displayed as "6.14 weeks"
- Grafana's "d" unit automatically converts to larger time units

Fixed behavior:
- Value of 43 displays as "43 days"
- Plain numeric value with static " days" suffix

This provides clearer, more consistent display of license expiry timing
without confusing time unit conversions.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added README.md to grafana_dashboards/ directory with complete
documentation for creating, editing, and deploying Grafana dashboards.

Documentation includes:

Dashboard Deployment:
- How dashboards are deployed via ConfigMaps
- Automatic Grafana provisioning process
- Important note: version field is not used for version control

Creating New Dashboards:
- Step-by-step UI workflow
- JSON export process
- JSON cleanup guidelines (removing ID, formatting)

Editing Existing Dashboards:
- UI editing workflow
- JSON update process
- Commit and deployment steps

Best Practices:
- Template variable usage (cluster_name, site_name)
- Panel naming conventions
- Query best practices (max vs sum, avoiding @ end())
- Panel layout guidelines
- Units configuration (preventing auto-conversion)

Testing:
- JSON syntax validation
- Deployment testing workflow
- Common issues and verification steps

Troubleshooting:
- Dashboard not updating after deployment
- Variables not populating
- Panels showing "N/A"
- JSON validation errors

This provides a complete reference for anyone working with PTD
Grafana dashboards, from first-time contributors to experienced
developers.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed three critical inaccuracies in the Grafana dashboard README:

1. Corrected file reference (High severity):
   - Changed: pulumi_resources/aws_workload_helm.py
   - Fixed: pulumi_resources/aws_eks_cluster.py
   - Rationale: Dashboards are loaded by aws_eks_cluster.py method
     _create_dashboard_configmaps (line 2556), not aws_workload_helm.py

2. Added Azure limitation note (High severity):
   - Added prominent warning at top of Dashboard Deployment section
   - Documented that ConfigMap provisioning only works for AWS
   - Added Azure manual import workflow (Grafana UI → Import)
   - Rationale: Azure's _define_grafana in azure_workload_helm.py (line 602)
     does not configure Grafana sidecar for dashboard watching, unlike AWS
     which enables it at aws_eks_cluster.py:2110-2114. All deployment
     workflows in the README only apply to AWS.

3. Fixed UID field guidance (Medium severity):
   - Removed incorrect advice: "remove uid field to generate new identifier"
   - Added correct info: "uid automatically set to match filename"
   - Rationale: Code at aws_eks_cluster.py:2586 enforces uid = filename
     for idempotency. Auto-generated UIDs are not allowed.

These fixes ensure the documentation accurately reflects the actual
implementation and prevents confusion for users working with Azure
deployments or understanding how dashboard UIDs work.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
t-margheim and others added 2 commits March 11, 2026 15:38
Fixed two issues in the simplified Grafana dashboard README:

1. Removed incorrect "sanitized version" claim (Medium severity):
   - Changed: "uid is set to match a sanitized version of the filename"
   - Fixed: "uid is set to match the filename"
   - Rationale: Code at aws_eks_cluster.py:2586 sets dashboard_json["uid"]
     = dashboard_name, where dashboard_name is the raw filename stem (line
     2571). The sanitize_k8s_name() function (line 2574) is only applied
     to k8s_safe_name for the ConfigMap name, not the dashboard UID.

2. Added missing trailing newline (Low severity):
   - File ended without a newline, violating POSIX text file standards
   - Can cause issues with some tools and git diffs
   - Added newline at end of file

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@t-margheim t-margheim marked this pull request as ready for review March 11, 2026 21:50
@t-margheim t-margheim requested a review from a team as a code owner March 11, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant