Add Posit Team Overview Dashboard by t-margheim · Pull Request #178 · posit-dev/ptd

t-margheim · 2026-03-11T21:45:39Z

Description

This PR adds a new dashboard focused on Posit Team application metrics. I also included some basic documentation about how the dashboards work with the IaC.

Code Flow

The dashboard is defined in posit-team-overview.json, which is deployed in the cluster step as a ConfigMap. When Grafana spots a change to the ConfigMap, the corresponding dashboard is redeployed.

Category of change

Bug fix (non-breaking change which fixes an issue)
Version upgrade (upgrading the version of a service or product)
New feature (non-breaking change which adds functionality)
Build: a code change that affects the build system or external dependencies
Performance: a code change that improves performance
Refactor: a code change that neither fixes a bug nor adds a feature
Documentation: documentation changes
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist

I have reviewed my own diff and added inline comments on lines I want reviewers to focus on or that I am uncertain about

…erview dashboard Add Grafana transformations and field overrides to the Running Version panel to improve readability and presentation: - Hide unnecessary "Value" and "Time" columns using organize transformation - Rename columns to user-friendly names (Site, Product, Version, Cluster) - Set logical column order (Site → Product → Version → Cluster) - Configure appropriate column widths (Site: 200px, Product: 150px, Version: 120px, Cluster: 150px) This change is presentation-only and does not modify the underlying Prometheus query. The dashboard will now display a cleaner, more operator-friendly table view. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…overview Fix three provisioning issues in the Posit Team Overview dashboard: 1. Set dashboard ID to null (was hardcoded to 48) to prevent import conflicts across different Grafana instances. Grafana auto-assigns IDs on import. 2. Add meaningful UID "posit_team_overview" (was empty string) for programmatic references and consistency with other dashboards. 3. Default cluster_name template variable to "All" (was hardcoded test cluster "default_duplicado03-staging-20250411-control-plane") to prevent exposure of internal cluster names and provide better default behavior. These changes align the dashboard with provisioning best practices used in other dashboards (alerts_dashboard.json, k8s-views-global.json). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…view transformations The PromQL query at line 104 groups by (release_name, version, ptd_site) and does not include cluster, but the transformation configuration still referenced cluster in both indexByName and renameByName. This mismatch would cause the Cluster column to be missing or empty in the rendered table. Removed orphaned references: - Deleted "cluster": 3 from indexByName - Deleted "cluster": "Cluster" from renameByName The transformation now correctly matches the query output, displaying only Site, Product, and Version columns. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…mption gauge Fixed two display issues in the "License Consumption by Site" gauge panel: 1. Division by Zero: Added value mappings to display "No Limit" (blue) for infinite values when license_seats is 0 (unlimited licenses), and "No Data" for NaN values when metrics are unavailable. This replaces the confusing infinity symbol (∞) with clear text. 2. Gauge Labels: Added field override using byName matcher to display only the site name (e.g., "site1") instead of showing the expression reference prefix (e.g., "C site1"). Changes to posit-team-overview.json: - Added special value mappings in fieldConfig.defaults.mappings for inf and null+nan - Replaced byValue override with byName override targeting expression "C" - Set displayName to ${__field.labels.ptd_site} to extract only the site label Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- update dashboard ID to null - correct UID format in Posit Team Overview - correct infinity value mapping case

…d panels Added cluster=~"$cluster_name" filter to all non-library panel queries to enable per-cluster filtering in the Posit Team Overview dashboard. This allows users to view metrics for specific clusters or all clusters using the dashboard variable. Updated panels: - Avg session start time (24h) - Active IDE sessions - Active user sessions - Registered users - Licensed users - License expires - Build info (both queries) - Requests/min(1m) - Avg resp secs (1m) - Response size /min (kb) - Request time quantiles (all 4 histogram queries) - Active sessions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

… Team Overview Addresses review feedback by adding cluster=~"$cluster_name" filter to four unlinked panels that were missing it: - Panel 2: Avg request duration (secs) - Panel 10: Requests in flight (second target) - Panel 4: Session start secs - Panel 8: Session start to join secs Also restores cluster variable default to "All" ($__all) and increments dashboard version from 1 to 7 to track evolution properly. These changes ensure consistent cluster filtering behavior across all panels when users select a specific cluster in the dashboard. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…osit Team Overview Applied ptd_site="$site_name" filter to all 22 pwb_* metric queries across the dashboard, ensuring the Site dropdown filters all panels consistently (previously only 2 of 24 panels were filtered). Also incremented dashboard version from 1 to 8 to properly continue version tracking. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…m Overview Changed all site filters from ptd_site="$site_name" to ptd_site=~"$site_name" to support regex pattern matching, which allows the "All" option in the Site dropdown to work correctly (when $site_name is set to ".*"). Updated all 22 pwb_* metric queries across the dashboard. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…metrics Changed "Registered users" panel from `max by(ptd_site, cluster)` to `max by(cluster, ptd_site)` to match the consistent label ordering pattern used throughout the dashboard and in the "Licensed users" panel. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added a complete "Connect" row with 17 panels that mirror the Workbench metrics structure, providing visibility into Connect operational metrics. Changes: - Added Connect row at y=23 with 17 panels (IDs 100-117) - Panels include HTTP metrics, user activity, queue metrics, and Shiny sessions - Updated site_name variable query to include both pwb_build_info and go_build_info - Updated dashboard version from 10 to 11 - Total panels increased from 18 to 36 Panel breakdown: - 7 gauge panels for key metrics (content views, active sessions, users) - 2 stat panels (build info, queue jobs) - 8 timeseries panels for detailed monitoring All panels use cluster and ptd_site filters for consistent filtering across both Workbench and Connect sections. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Changed aggregation from sum to max for Connect metrics that are reported identically by all pods, preventing double-counting when multiple Connect pods are running. Fixed panels: - Panel 103: Active users (24h) - changed sum to max - Panel 104: Active users (7d) - changed sum to max - Panel 105: Active users (30d) - changed sum to max - Panel 107: Queue jobs - changed sum to max Rationale: connect_users_active and connect_jobs_queue_total_jobs_in_queue represent global system state queried from shared resources (database, queue). All Connect pods report identical values, so sum() incorrectly multiplies by the number of pods. Using max() deduplicates across pods while preserving the correct value. This mirrors the fix in commit 29c44bf for Workbench license metrics. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fixed two panel titles where the title did not accurately reflect what the query was calculating: Panel 112 - Response size: - Changed: "Response size/min (kb)" → "Response size/5m (kb)" - Query: increase(...[5m]) / 1024 - Rationale: Query uses increase() over 5m window, which returns total bytes over 5 minutes, not per-minute rate. Title must match. - This reverts the problematic title change from commit 14a8fa1 and restores the correct title from commit c8765d8. Panel 109 - Request rate gauge: - Changed: "Requests/5m" → "Requests/min (5m)" - Query: rate(...[5m]) * 60 - Rationale: Query calculates rate (per-second) * 60 = per-minute rate. The "(5m)" clarifies the lookback window used for the rate calculation. Previous title "Requests/5m" incorrectly suggested total requests over 5 minutes rather than a per-minute rate. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added a complete "Package Manager" row with 18 panels that provide comprehensive monitoring of Package Manager operational metrics. Changes: - Added Package Manager row at y=46 with 18 panels (IDs 200-218) - Dashboard version updated from 14 to 15 - Total panels increased from 36 to 54 Panel Categories: Storage Health (critical for PPM): - Storage used % (gauge) - Calculated percentage of storage consumed - Storage free (GB) (gauge) - Available storage space - Storage used over time (timeseries) - Historical usage by path HTTP Performance (parallel to Workbench/Connect): - Requests/5m (timeseries) - HTTP request volume - Requests/min (5m) (gauge) - Current request rate - Avg response size (timeseries) - Response size by status code - Avg resp size (5m) (gauge) - Current average response size - Response size/5m (kb) (timeseries) - Total response bytes - Requests in flight (stat + timeseries) - Current/historical in-flight requests Package Operations (unique to PPM): - Package downloads/min (24h) (gauge) - Average daily download rate - Package downloads/5m (timeseries) - Download activity by repo Repository Sync Operations: - Sync duration (timeseries) - p95 sync time by source type (CRAN, PyPI, etc.) Binary Routing: - Binary routing fallbacks (timeseries) - Failed binary routing by reason License Status: - License days left (gauge) - Days until license expiry - License expires (stat) - Expiry timestamp - Failed Git builds (stat) - Total failed Git builds - Build info (stat) - Version information All panels use cluster and ptd_site filters for consistent filtering across all three product sections (Workbench, Connect, Package Manager). Key metrics use max() aggregation to prevent double-counting when multiple PPM pods report identical global state (storage, license). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fixed two issues in Package Manager panels: Panel 202 - Package downloads/min (24h): - Removed `@ end()` modifier from query - Previous: increase(...[24h] @ end()) / (24 * 60) - Fixed: increase(...[24h]) / (24 * 60) - Rationale: The @ end() modifier locks the query to the range end timestamp, breaking historical playback in time-range dashboards and causing incorrect behavior in real-time views. For a gauge showing current download rate, the query should evaluate at the current dashboard time, not a fixed endpoint. Panel 216 - Storage used (GB): - Added missing unit configuration: "unit": "decgbytes" - Rationale: Panel 204 (Storage free GB) uses "decgbytes" unit for consistent GB formatting. Panel 216 was displaying raw metric values without unit formatting, creating display inconsistency. Both panels now show storage metrics with consistent GB display formatting. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…left panel Changed panel 203 (License days left) to display the value with a fixed "days" suffix instead of allowing Grafana to auto-convert to larger time units like weeks. Changes: - Unit changed from "d" (days) to "none" - Added custom suffix: " days" Previous behavior: - Value of 43 displayed as "6.14 weeks" - Grafana's "d" unit automatically converts to larger time units Fixed behavior: - Value of 43 displays as "43 days" - Plain numeric value with static " days" suffix This provides clearer, more consistent display of license expiry timing without confusing time unit conversions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added README.md to grafana_dashboards/ directory with complete documentation for creating, editing, and deploying Grafana dashboards. Documentation includes: Dashboard Deployment: - How dashboards are deployed via ConfigMaps - Automatic Grafana provisioning process - Important note: version field is not used for version control Creating New Dashboards: - Step-by-step UI workflow - JSON export process - JSON cleanup guidelines (removing ID, formatting) Editing Existing Dashboards: - UI editing workflow - JSON update process - Commit and deployment steps Best Practices: - Template variable usage (cluster_name, site_name) - Panel naming conventions - Query best practices (max vs sum, avoiding @ end()) - Panel layout guidelines - Units configuration (preventing auto-conversion) Testing: - JSON syntax validation - Deployment testing workflow - Common issues and verification steps Troubleshooting: - Dashboard not updating after deployment - Variables not populating - Panels showing "N/A" - JSON validation errors This provides a complete reference for anyone working with PTD Grafana dashboards, from first-time contributors to experienced developers. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fixed three critical inaccuracies in the Grafana dashboard README: 1. Corrected file reference (High severity): - Changed: pulumi_resources/aws_workload_helm.py - Fixed: pulumi_resources/aws_eks_cluster.py - Rationale: Dashboards are loaded by aws_eks_cluster.py method _create_dashboard_configmaps (line 2556), not aws_workload_helm.py 2. Added Azure limitation note (High severity): - Added prominent warning at top of Dashboard Deployment section - Documented that ConfigMap provisioning only works for AWS - Added Azure manual import workflow (Grafana UI → Import) - Rationale: Azure's _define_grafana in azure_workload_helm.py (line 602) does not configure Grafana sidecar for dashboard watching, unlike AWS which enables it at aws_eks_cluster.py:2110-2114. All deployment workflows in the README only apply to AWS. 3. Fixed UID field guidance (Medium severity): - Removed incorrect advice: "remove uid field to generate new identifier" - Added correct info: "uid automatically set to match filename" - Rationale: Code at aws_eks_cluster.py:2586 enforces uid = filename for idempotency. Auto-generated UIDs are not allowed. These fixes ensure the documentation accurately reflects the actual implementation and prevents confusion for users working with Azure deployments or understanding how dashboard UIDs work. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Fixed two issues in the simplified Grafana dashboard README: 1. Removed incorrect "sanitized version" claim (Medium severity): - Changed: "uid is set to match a sanitized version of the filename" - Fixed: "uid is set to match the filename" - Rationale: Code at aws_eks_cluster.py:2586 sets dashboard_json["uid"] = dashboard_name, where dashboard_name is the raw filename stem (line 2571). The sanitize_k8s_name() function (line 2574) is only applied to k8s_safe_name for the ConfigMap name, not the dashboard UID. 2. Added missing trailing newline (Low severity): - File ended without a newline, violating POSIX text file standards - Can cause issues with some tools and git diffs - Added newline at end of file Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

t-margheim and others added 30 commits March 11, 2026 10:52

Rename file

5abff79

Simplify table

4d3354e

fix(grafana):

f18abf8

- update dashboard ID to null - correct UID format in Posit Team Overview - correct infinity value mapping case

Unlink panels

83e1827

Start adding site to queries

71d750f

Aggregate user metrics

29c44bf

Add grouping for Workbench metrics

425ac88

Change scrape interval for some panels

8ca1d88

Fix resp bytes metric

65e4cfc

Fix panel title

c8765d8

Remove empty build info panel, update collection intervals

14a8fa1

Minor UI adjustments

c22ec8a

Remove some invalid metrics

f625a26

t-margheim and others added 2 commits March 11, 2026 15:38

Simplify documentation

9446605

t-margheim marked this pull request as ready for review March 11, 2026 21:50

t-margheim requested a review from a team as a code owner March 11, 2026 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Posit Team Overview Dashboard#178

Add Posit Team Overview Dashboard#178
t-margheim wants to merge 32 commits intomainfrom
dashboard-posit-team-overview

t-margheim commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

t-margheim commented Mar 11, 2026

Description

Code Flow

Category of change

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant