From d64fff812f2dc9aa42edcf29bf1c715ba5fb059f Mon Sep 17 00:00:00 2001 From: Jens Neuse Date: Wed, 14 Jan 2026 06:25:36 -0500 Subject: [PATCH] feat: add capability matrix --- capabilities/access-control/api-keys.md | 183 +++++++++++ capabilities/access-control/audit-logging.md | 183 +++++++++++ capabilities/access-control/groups.md | 178 +++++++++++ capabilities/access-control/rbac.md | 170 +++++++++++ .../access-control/scim-provisioning.md | 184 +++++++++++ .../access-control/session-management.md | 182 +++++++++++ capabilities/access-control/single-sign-on.md | 186 +++++++++++ .../access-control/user-invitations.md | 184 +++++++++++ capabilities/ai/mcp.md | 234 ++++++++++++++ capabilities/analytics/analytics-dashboard.md | 148 +++++++++ .../analytics/client-identification.md | 162 ++++++++++ capabilities/analytics/metrics-analytics.md | 147 +++++++++ capabilities/analytics/operations-tracking.md | 161 ++++++++++ capabilities/analytics/schema-field-usage.md | 149 +++++++++ capabilities/analytics/trace-analytics.md | 147 +++++++++ capabilities/cli/cosmo-cli.md | 190 ++++++++++++ .../compliance/advanced-data-privacy.md | 164 ++++++++++ .../compliance/compliance-certifications.md | 151 +++++++++ capabilities/compliance/ip-anonymization.md | 160 ++++++++++ .../compliance/variable-export-control.md | 160 ++++++++++ capabilities/deployment/cluster-management.md | 140 +++++++++ capabilities/deployment/cosmo-cloud.md | 138 +++++++++ capabilities/deployment/docker.md | 151 +++++++++ capabilities/deployment/kubernetes.md | 144 +++++++++ .../router-compatibility-versions.md | 138 +++++++++ capabilities/deployment/self-hosted.md | 142 +++++++++ capabilities/deployment/storage-providers.md | 141 +++++++++ capabilities/deployment/terraform.md | 142 +++++++++ .../breaking-change-overrides.md | 155 ++++++++++ .../developer-experience/changelog.md | 151 +++++++++ .../custom-playground-scripts.md | 155 ++++++++++ .../developer-experience/graph-pruning.md | 158 ++++++++++ .../graphiql-playground.md | 146 +++++++++ .../developer-experience/lint-policies.md | 169 ++++++++++ .../query-plan-visualization.md | 146 +++++++++ .../developer-experience/schema-explorer.md | 149 +++++++++ .../shared-playground-state.md | 155 ++++++++++ capabilities/extensibility/custom-modules.md | 157 ++++++++++ .../subgraph-check-extensions.md | 156 ++++++++++ capabilities/feature-flags/feature-flags.md | 230 ++++++++++++++ .../federation/federation-directives.md | 201 ++++++++++++ capabilities/federation/graphql-federation.md | 171 +++++++++++ capabilities/federation/monograph-support.md | 208 +++++++++++++ capabilities/federation/schema-checks.md | 183 +++++++++++ capabilities/federation/schema-composition.md | 171 +++++++++++ capabilities/federation/schema-contracts.md | 187 ++++++++++++ capabilities/federation/schema-registry.md | 170 +++++++++++ .../federation/subgraph-management.md | 223 ++++++++++++++ capabilities/grpc/cosmo-connect.md | 209 +++++++++++++ capabilities/grpc/grpc-services.md | 213 +++++++++++++ capabilities/grpc/router-plugins.md | 215 +++++++++++++ capabilities/index.md | 276 +++++++++++++++++ capabilities/migration/apollo-migration.md | 156 ++++++++++ .../migration/apollo-router-migration.md | 163 ++++++++++ .../migration/federation-compatibility.md | 174 +++++++++++ .../notifications/alerts-notifications.md | 199 ++++++++++++ .../notifications/slack-integration.md | 200 ++++++++++++ .../notifications/webhook-notifications.md | 222 ++++++++++++++ capabilities/observability/access-logs.md | 233 ++++++++++++++ .../observability/advanced-request-tracing.md | 201 ++++++++++++ .../observability/distributed-tracing.md | 204 +++++++++++++ .../observability/grafana-integration.md | 212 +++++++++++++ capabilities/observability/opentelemetry.md | 205 +++++++++++++ .../otel-collector-integration.md | 248 +++++++++++++++ capabilities/observability/profiling.md | 226 ++++++++++++++ .../observability/prometheus-metrics.md | 219 +++++++++++++ .../automatic-persisted-queries.md | 177 +++++++++++ capabilities/performance/cache-control.md | 181 +++++++++++ capabilities/performance/cache-warmer.md | 165 ++++++++++ .../performance/performance-debugging.md | 162 ++++++++++ .../performance/persisted-operations.md | 147 +++++++++ capabilities/proxy/file-upload.md | 148 +++++++++ .../proxy/forward-client-extensions.md | 150 +++++++++ .../proxy/override-subgraph-config.md | 146 +++++++++ .../proxy/request-header-operations.md | 149 +++++++++ .../proxy/response-header-operations.md | 148 +++++++++ capabilities/real-time/cosmo-streams.md | 211 +++++++++++++ .../real-time/graphql-subscriptions.md | 195 ++++++++++++ capabilities/router/config-hot-reload.md | 170 +++++++++++ capabilities/router/development-mode.md | 156 ++++++++++ .../router/graphql-federation-router.md | 141 +++++++++ capabilities/router/query-batching.md | 164 ++++++++++ capabilities/router/query-planning.md | 140 +++++++++ capabilities/router/router-configuration.md | 147 +++++++++ .../security/authorization-directives.md | 159 ++++++++++ capabilities/security/config-signing.md | 164 ++++++++++ .../security/introspection-control.md | 178 +++++++++++ capabilities/security/jwt-authentication.md | 159 ++++++++++ capabilities/security/security-hardening.md | 217 +++++++++++++ .../security/subgraph-error-propagation.md | 214 +++++++++++++ capabilities/security/tls-https.md | 158 ++++++++++ capabilities/template.md | 289 ++++++++++++++++++ .../traffic-management/circuit-breaker.md | 238 +++++++++++++++ .../traffic-management/retry-mechanism.md | 216 +++++++++++++ .../timeout-configuration.md | 220 +++++++++++++ .../traffic-management/traffic-shaping.md | 202 ++++++++++++ 96 files changed, 17086 insertions(+) create mode 100644 capabilities/access-control/api-keys.md create mode 100644 capabilities/access-control/audit-logging.md create mode 100644 capabilities/access-control/groups.md create mode 100644 capabilities/access-control/rbac.md create mode 100644 capabilities/access-control/scim-provisioning.md create mode 100644 capabilities/access-control/session-management.md create mode 100644 capabilities/access-control/single-sign-on.md create mode 100644 capabilities/access-control/user-invitations.md create mode 100644 capabilities/ai/mcp.md create mode 100644 capabilities/analytics/analytics-dashboard.md create mode 100644 capabilities/analytics/client-identification.md create mode 100644 capabilities/analytics/metrics-analytics.md create mode 100644 capabilities/analytics/operations-tracking.md create mode 100644 capabilities/analytics/schema-field-usage.md create mode 100644 capabilities/analytics/trace-analytics.md create mode 100644 capabilities/cli/cosmo-cli.md create mode 100644 capabilities/compliance/advanced-data-privacy.md create mode 100644 capabilities/compliance/compliance-certifications.md create mode 100644 capabilities/compliance/ip-anonymization.md create mode 100644 capabilities/compliance/variable-export-control.md create mode 100644 capabilities/deployment/cluster-management.md create mode 100644 capabilities/deployment/cosmo-cloud.md create mode 100644 capabilities/deployment/docker.md create mode 100644 capabilities/deployment/kubernetes.md create mode 100644 capabilities/deployment/router-compatibility-versions.md create mode 100644 capabilities/deployment/self-hosted.md create mode 100644 capabilities/deployment/storage-providers.md create mode 100644 capabilities/deployment/terraform.md create mode 100644 capabilities/developer-experience/breaking-change-overrides.md create mode 100644 capabilities/developer-experience/changelog.md create mode 100644 capabilities/developer-experience/custom-playground-scripts.md create mode 100644 capabilities/developer-experience/graph-pruning.md create mode 100644 capabilities/developer-experience/graphiql-playground.md create mode 100644 capabilities/developer-experience/lint-policies.md create mode 100644 capabilities/developer-experience/query-plan-visualization.md create mode 100644 capabilities/developer-experience/schema-explorer.md create mode 100644 capabilities/developer-experience/shared-playground-state.md create mode 100644 capabilities/extensibility/custom-modules.md create mode 100644 capabilities/extensibility/subgraph-check-extensions.md create mode 100644 capabilities/feature-flags/feature-flags.md create mode 100644 capabilities/federation/federation-directives.md create mode 100644 capabilities/federation/graphql-federation.md create mode 100644 capabilities/federation/monograph-support.md create mode 100644 capabilities/federation/schema-checks.md create mode 100644 capabilities/federation/schema-composition.md create mode 100644 capabilities/federation/schema-contracts.md create mode 100644 capabilities/federation/schema-registry.md create mode 100644 capabilities/federation/subgraph-management.md create mode 100644 capabilities/grpc/cosmo-connect.md create mode 100644 capabilities/grpc/grpc-services.md create mode 100644 capabilities/grpc/router-plugins.md create mode 100644 capabilities/index.md create mode 100644 capabilities/migration/apollo-migration.md create mode 100644 capabilities/migration/apollo-router-migration.md create mode 100644 capabilities/migration/federation-compatibility.md create mode 100644 capabilities/notifications/alerts-notifications.md create mode 100644 capabilities/notifications/slack-integration.md create mode 100644 capabilities/notifications/webhook-notifications.md create mode 100644 capabilities/observability/access-logs.md create mode 100644 capabilities/observability/advanced-request-tracing.md create mode 100644 capabilities/observability/distributed-tracing.md create mode 100644 capabilities/observability/grafana-integration.md create mode 100644 capabilities/observability/opentelemetry.md create mode 100644 capabilities/observability/otel-collector-integration.md create mode 100644 capabilities/observability/profiling.md create mode 100644 capabilities/observability/prometheus-metrics.md create mode 100644 capabilities/performance/automatic-persisted-queries.md create mode 100644 capabilities/performance/cache-control.md create mode 100644 capabilities/performance/cache-warmer.md create mode 100644 capabilities/performance/performance-debugging.md create mode 100644 capabilities/performance/persisted-operations.md create mode 100644 capabilities/proxy/file-upload.md create mode 100644 capabilities/proxy/forward-client-extensions.md create mode 100644 capabilities/proxy/override-subgraph-config.md create mode 100644 capabilities/proxy/request-header-operations.md create mode 100644 capabilities/proxy/response-header-operations.md create mode 100644 capabilities/real-time/cosmo-streams.md create mode 100644 capabilities/real-time/graphql-subscriptions.md create mode 100644 capabilities/router/config-hot-reload.md create mode 100644 capabilities/router/development-mode.md create mode 100644 capabilities/router/graphql-federation-router.md create mode 100644 capabilities/router/query-batching.md create mode 100644 capabilities/router/query-planning.md create mode 100644 capabilities/router/router-configuration.md create mode 100644 capabilities/security/authorization-directives.md create mode 100644 capabilities/security/config-signing.md create mode 100644 capabilities/security/introspection-control.md create mode 100644 capabilities/security/jwt-authentication.md create mode 100644 capabilities/security/security-hardening.md create mode 100644 capabilities/security/subgraph-error-propagation.md create mode 100644 capabilities/security/tls-https.md create mode 100644 capabilities/template.md create mode 100644 capabilities/traffic-management/circuit-breaker.md create mode 100644 capabilities/traffic-management/retry-mechanism.md create mode 100644 capabilities/traffic-management/timeout-configuration.md create mode 100644 capabilities/traffic-management/traffic-shaping.md diff --git a/capabilities/access-control/api-keys.md b/capabilities/access-control/api-keys.md new file mode 100644 index 00000000..c833eba7 --- /dev/null +++ b/capabilities/access-control/api-keys.md @@ -0,0 +1,183 @@ +# API Keys + +Granular API key permissions with resource-level access for secure automation and CI/CD integration. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ac-003` | +| **Category** | Access Control | +| **Status** | GA | +| **Availability** | Free, Pro, Scale, Enterprise | +| **Related Capabilities** | `cap-ac-001` (RBAC), `cap-ac-002` (Groups), `cap-ac-005` (SCIM) | + +--- + +## Quick Reference + +### Name +API Keys + +### Tagline +Secure automation with granular permissions. + +### Elevator Pitch +API Keys enable secure programmatic access to Cosmo for automation, CI/CD pipelines, and CLI usage. Each key can be scoped to specific groups and permissions, ensuring your automated systems have exactly the access they need - no more, no less. Built-in expiration controls and group-based permissions make key management simple and secure. + +--- + +## Problem & Solution + +### The Problem +Automation and CI/CD systems need programmatic access to manage schemas, run checks, and deploy changes. Without proper controls, API keys often receive overly broad permissions, creating security risks. Managing multiple keys across different systems and environments becomes complex, and rotating or revoking keys is error-prone. + +### The Solution +Cosmo API Keys integrate with the groups system, inheriting permissions from assigned groups. Keys can be created with specific expiration dates, assigned to groups with precisely scoped permissions, and managed centrally. Special permissions like SCIM can be enabled only when needed, maintaining the principle of least privilege. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| All-or-nothing API key access | Group-based granular permissions | +| No key expiration management | Configurable expiration dates | +| Unclear what each key can access | Permissions visible via group assignment | +| Separate permission systems for keys and users | Unified groups for both users and API keys | + +--- + +## Key Benefits + +1. **Group-Based Permissions**: API keys inherit permissions from assigned groups, using the same permission model as users for consistency. + +2. **Flexible Expiration**: Set expiration dates when creating keys, or choose never-expiring keys for long-running automation with proper security controls. + +3. **Scoped Access**: Through groups, limit keys to specific namespaces, graphs, or subgraphs, following the principle of least privilege. + +4. **Special Permissions**: Enable additional capabilities like SCIM provisioning only when needed, keeping default keys minimal. + +5. **Centralized Management**: View and manage all organization API keys from a single interface with visibility into creator, expiration, and group assignment. + +--- + +## Target Audience + +### Primary Persona +- **Role**: DevOps Engineer / Platform Engineer +- **Pain Points**: Managing CI/CD credentials securely, ensuring automation has correct permissions, rotating keys safely +- **Goals**: Enable automation without security risks, maintain visibility into what automated systems can do + +### Secondary Personas +- Security engineers auditing API access +- Developers using CLI tools locally +- Infrastructure teams managing multiple environments + +--- + +## Use Cases + +### Use Case 1: CI/CD Pipeline Integration +**Scenario**: A development team needs their GitHub Actions workflow to publish schema changes to their subgraph on every merge to main. + +**How it works**: +1. Create a group with Subgraph Publisher role scoped to the team's subgraph +2. Generate an API key assigned to this group +3. Store the key in GitHub Secrets +4. Configure the workflow to use the key with the wgc CLI + +**Outcome**: Automated schema publishing with permissions limited to exactly what the pipeline needs. + +### Use Case 2: Local Development Access +**Scenario**: Developers need CLI access to check schemas and explore the graph during development without admin permissions. + +**How it works**: +1. Create a group with Subgraph Checker role for dev namespace and Graph Viewer for visibility +2. Generate personal API keys for each developer assigned to this group +3. Developers configure their local wgc CLI with their personal key + +**Outcome**: Developers can work productively while production resources remain protected. + +### Use Case 3: SCIM Integration Setup +**Scenario**: An organization wants to automate user provisioning from Okta to Cosmo. + +**How it works**: +1. Create an API key with SCIM permission enabled +2. Set expiration to "Never" for long-running integration +3. Configure Okta SCIM connector with the API key as the authorization header +4. Users added in Okta automatically receive invitations to join the organization + +**Outcome**: Fully automated user lifecycle management without manual intervention. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Unified permission model for users and API keys through groups +2. SCIM permission enables enterprise identity automation +3. Clear visibility into key permissions via group assignment + +--- + +## Technical Summary + +### How It Works +API keys are created through Cosmo Studio and assigned to a group at creation time. The key inherits all permissions from the assigned group's rules. When making API calls using the key, Cosmo validates the request against the group's permissions. Special permissions like SCIM are configured separately during key creation. + +### Key Technical Features +- Create keys with name, expiration, and group assignment +- One-time key display with secure copy functionality +- Configurable expiration: 30 days, 6 months, 1 year, or never +- SCIM permission for identity management integration +- Keys visible to Admin and Developer roles only +- Legacy resource-scoped keys migrated to groups system + +### Integration Points +- wgc CLI for local and automation use +- CI/CD platforms (GitHub Actions, GitLab CI, Jenkins, etc.) +- SCIM-compatible identity providers (Okta, etc.) +- Custom automation scripts and tools + +### Requirements & Prerequisites +- Admin or Developer role to create API keys +- RBAC enabled for group-based permissions +- Secure storage for keys (shown only once at creation) + +--- + +## Documentation References + +- Primary docs: `/docs/studio/api-keys` +- API key permissions: `/docs/studio/api-keys/api-key-permissions` +- API key resources: `/docs/studio/api-keys/api-key-resources` +- Groups: `/docs/studio/groups` +- CLI reference: `/docs/cli/intro` + +--- + +## Keywords & SEO + +### Primary Keywords +- API key management +- GraphQL API keys +- Automation credentials + +### Secondary Keywords +- CI/CD API access +- Programmatic GraphQL access +- Federation automation + +### Related Search Terms +- How to create API keys for GraphQL +- Secure CI/CD credentials for federation +- SCIM API key configuration + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/access-control/audit-logging.md b/capabilities/access-control/audit-logging.md new file mode 100644 index 00000000..0631bb30 --- /dev/null +++ b/capabilities/access-control/audit-logging.md @@ -0,0 +1,183 @@ +# Audit Logging + +Complete audit trail of all user and API actions for security, compliance, and operational visibility. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ac-006` | +| **Category** | Access Control | +| **Status** | GA | +| **Availability** | Pro, Scale, Enterprise | +| **Related Capabilities** | `cap-ac-001` (RBAC), `cap-ac-003` (API Keys), `cap-ac-008` (Sessions) | + +--- + +## Quick Reference + +### Name +Audit Logging + +### Tagline +Complete visibility into all organization actions. + +### Elevator Pitch +Audit Logging provides a detailed, immutable record of every action taken within your organization - whether by users directly or through API keys. Track who did what, when they did it, and how they authenticated, giving you the visibility needed for security analysis, compliance reporting, and incident investigation. + +--- + +## Problem & Solution + +### The Problem +When security incidents occur or compliance audits require documentation, organizations often struggle to answer basic questions: Who made this change? When did it happen? Was it a human or an automated system? Without comprehensive audit trails, incident response is slow, accountability is unclear, and compliance becomes a manual, error-prone exercise. + +### The Solution +Cosmo's Audit Logging automatically captures every significant action in your organization with rich context including the actor (user or API key), the action performed, the affected resources, and the timestamp. Logs are presented in chronological order with clear visual indicators distinguishing human actions from automated ones, making it simple to investigate incidents and demonstrate compliance. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Unknown who made changes | Clear actor attribution | +| No visibility into API key actions | Separate tracking for user vs key actions | +| Manual compliance documentation | Automatic audit trail | +| Slow incident investigation | Searchable, filterable log history | + +--- + +## Key Benefits + +1. **Complete Visibility**: Every action is logged with full context - actor, action, resource, and timestamp - providing end-to-end visibility into organization activity. + +2. **Actor Attribution**: Clear distinction between actions performed by users directly versus actions performed through API keys, with visual indicators for quick identification. + +3. **Compliance Ready**: Immutable audit records support compliance requirements for SOC 2, GDPR, and other regulatory frameworks requiring activity logging. + +4. **Incident Response**: When issues occur, quickly identify what changed, who made the change, and when it happened, accelerating root cause analysis. + +5. **Platform Events**: System-generated events from the Cosmo platform are also logged, providing visibility into automated platform actions alongside user activities. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Security Engineer / Compliance Officer +- **Pain Points**: Demonstrating compliance during audits, investigating security incidents, understanding system changes +- **Goals**: Maintain comprehensive audit trails, enable fast incident response, satisfy compliance requirements + +### Secondary Personas +- Platform administrators troubleshooting configuration issues +- Engineering managers understanding team activity +- DevOps engineers investigating production changes + +--- + +## Use Cases + +### Use Case 1: Security Incident Investigation +**Scenario**: An unexpected schema change was deployed to production and the team needs to understand what happened. + +**How it works**: +1. Navigate to the Audit Log in your organization settings +2. Filter by the relevant time range and resource +3. Identify the exact action, actor, and timestamp +4. Determine if the action was via user interface or API key +5. Follow up with the appropriate team or review API key usage + +**Outcome**: Root cause identified in minutes rather than hours, with clear accountability established. + +### Use Case 2: Compliance Audit Documentation +**Scenario**: An auditor requires evidence of access control and change management processes for SOC 2 compliance. + +**How it works**: +1. Access the Audit Log for the relevant time period +2. Export or screenshot evidence of role assignments and changes +3. Demonstrate that all changes are attributed to specific actors +4. Show separation of duties through role-based actions + +**Outcome**: Compliance evidence readily available without manual documentation efforts. + +### Use Case 3: API Key Activity Monitoring +**Scenario**: A security team wants to regularly review what actions are being performed by CI/CD API keys. + +**How it works**: +1. Open the Audit Log and identify entries with the API key icon +2. Review the actions performed by each API key +3. Verify actions align with expected CI/CD operations +4. Investigate any unexpected actions or patterns + +**Outcome**: Ongoing visibility into automated system behavior, enabling early detection of misuse or misconfiguration. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Visual distinction between user and API key actions +2. Platform-generated events included alongside user actions +3. Chronological presentation with rich context per event + +--- + +## Technical Summary + +### How It Works +Every API call and user action in Cosmo is intercepted and logged with contextual metadata. The audit log captures the actor (user email or API key identifier), the action type, affected resources, and timestamp. Logs are stored immutably and presented in the Studio interface sorted by creation date in descending order. + +### Key Technical Features +- Automatic capture of all significant actions +- User vs API key actor distinction with visual indicators +- Platform-generated event logging +- Chronological ordering by creation date +- Resource and action type metadata +- Namespace-aware action tracking + +### Integration Points +- Cosmo Studio for log viewing +- All CLI (wgc) operations logged +- Platform API actions logged +- Studio UI actions logged + +### Requirements & Prerequisites +- Pro plan or higher +- No additional configuration required - logging is automatic +- Access to organization settings to view logs + +--- + +## Documentation References + +- Primary docs: `/docs/studio/audit-log` +- RBAC: `/docs/studio/rbac` +- API keys: `/docs/studio/api-keys` + +--- + +## Keywords & SEO + +### Primary Keywords +- Audit logging +- Activity tracking +- Change management + +### Secondary Keywords +- GraphQL audit trail +- Compliance logging +- Security audit log + +### Related Search Terms +- How to track changes in GraphQL federation +- API activity monitoring +- SOC 2 compliance for GraphQL + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/access-control/groups.md b/capabilities/access-control/groups.md new file mode 100644 index 00000000..3beac71c --- /dev/null +++ b/capabilities/access-control/groups.md @@ -0,0 +1,178 @@ +# Groups & Group Rules + +Centralized user and API key access management with SSO rule mapping for streamlined team permissions. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ac-002` | +| **Category** | Access Control | +| **Status** | GA | +| **Availability** | Scale, Enterprise | +| **Related Capabilities** | `cap-ac-001` (RBAC), `cap-ac-003` (API Keys), `cap-ac-004` (SSO), `cap-ac-005` (SCIM) | + +--- + +## Quick Reference + +### Name +Groups & Group Rules + +### Tagline +Centralized team access without manual setup. + +### Elevator Pitch +Groups provide a unified way to manage access for both organization members and API keys. By defining group rules that specify roles and resources, you can control exactly what each team or service can access. Combined with SCIM integration, groups enable fully automated access management that scales with your organization. + +--- + +## Problem & Solution + +### The Problem +Managing individual user permissions across a growing organization becomes increasingly complex. When team members join, leave, or change roles, administrators must manually update each user's access. This manual process is error-prone, time-consuming, and often results in permission drift where users accumulate more access than they need. + +### The Solution +Cosmo Groups centralize access management by grouping users and API keys with shared permission requirements. Group rules define the roles and resource scopes for all group members, making it simple to grant, modify, or revoke access for entire teams at once. Integration with SSO and SCIM ensures group memberships stay synchronized with your identity provider. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Individual permission management | Group-based access control | +| Manual onboarding/offboarding | Automated via SSO/SCIM integration | +| Permission drift over time | Consistent access through group rules | +| Separate user and API key management | Unified groups for both users and keys | + +--- + +## Key Benefits + +1. **Unified Access Management**: Manage both organization members and API keys through the same group system, eliminating duplicate configuration. + +2. **Flexible Role Assignment**: Assign multiple roles per group with different resource scopes, enabling complex permission patterns. + +3. **SSO/SCIM Integration**: Automatically assign users to groups based on identity provider attributes, eliminating manual group membership management. + +4. **Safe Group Lifecycle**: Built-in safeguards when deleting groups ensure users and API keys are reassigned, preventing accidental access loss. + +5. **Built-in Defaults**: Pre-configured admin, developer, and viewer groups provide sensible defaults while allowing custom group creation for specific needs. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Administrator / Security Engineer +- **Pain Points**: Manual user permission management, keeping access synchronized with HR systems, ensuring consistent permissions +- **Goals**: Automate access management and reduce administrative overhead while maintaining security + +### Secondary Personas +- Team leads managing team member access +- DevOps engineers configuring CI/CD access +- IT administrators integrating identity systems + +--- + +## Use Cases + +### Use Case 1: Team Onboarding Automation +**Scenario**: New developers join the platform team and need immediate access to their team's subgraphs with read access to the broader system. + +**How it works**: +1. Create a group named "Platform Team" with Subgraph Admin role scoped to the platform namespace and Graph Viewer for all graphs +2. Configure OIDC mapper or SCIM to automatically assign users with "platform-team" attribute to this group +3. When new developers are added in your identity provider, they automatically receive correct permissions + +**Outcome**: Zero-touch onboarding with developers productive immediately without manual permission configuration. + +### Use Case 2: Service Account Management +**Scenario**: Multiple CI/CD pipelines need different levels of access - some should only run checks, others need to publish schemas. + +**How it works**: +1. Create a "CI Checkers" group with Subgraph Checker role +2. Create a "CI Publishers" group with Subgraph Publisher role scoped to specific namespaces +3. Generate API keys assigned to the appropriate group for each pipeline + +**Outcome**: Each pipeline has precisely the permissions it needs, following the principle of least privilege. + +### Use Case 3: Cross-Functional Access Patterns +**Scenario**: A platform architect needs admin access to production namespace but also needs to check schemas across all development namespaces. + +**How it works**: +1. Create a custom group with multiple rules: Namespace Admin for production, Subgraph Checker for dev namespaces +2. Assign the architect to this group + +**Outcome**: Complex access requirements are satisfied through a single group assignment without over-provisioning access. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Unified groups for both human users and API keys +2. Multiple role assignments per group with different resource scopes +3. Safe deletion workflow with automatic reassignment options + +--- + +## Technical Summary + +### How It Works +Groups contain one or more group rules, each defining a role and optional resource scope. When a user or API key is assigned to a group, they inherit all permissions from the group's rules. If a rule has no explicit resources, it grants access to all resources of that type in the organization. Multiple roles can coexist in one group, and the most permissive access applies when scopes overlap. + +### Key Technical Features +- Create custom groups with name and description +- Add multiple rules per group with different roles +- Scope rules to specific namespaces, graphs, or subgraphs +- Built-in groups (admin, developer, viewer) for common patterns +- Group reassignment workflow during deletion +- OIDC mapper support for automatic group assignment + +### Integration Points +- SSO providers via OIDC for automatic group assignment +- SCIM for automated user provisioning into groups +- API key system for service account access + +### Requirements & Prerequisites +- Scale plan or higher +- RBAC must be enabled in organization settings +- Built-in groups cannot be modified or deleted + +--- + +## Documentation References + +- Primary docs: `/docs/studio/groups` +- Group rules: `/docs/studio/groups/group-rules` +- RBAC overview: `/docs/studio/rbac` +- API keys: `/docs/studio/api-keys` + +--- + +## Keywords & SEO + +### Primary Keywords +- User group management +- Access control groups +- Team permissions + +### Secondary Keywords +- GraphQL team access +- API access groups +- Permission groups + +### Related Search Terms +- How to manage team access to GraphQL +- Group-based access control +- Federation team permissions + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/access-control/rbac.md b/capabilities/access-control/rbac.md new file mode 100644 index 00000000..52984ff2 --- /dev/null +++ b/capabilities/access-control/rbac.md @@ -0,0 +1,170 @@ +# Role-Based Access Control (RBAC) + +Granular permission management by role for secure, structured access to your federated graph platform. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ac-001` | +| **Category** | Access Control | +| **Status** | GA | +| **Availability** | Scale, Enterprise | +| **Related Capabilities** | `cap-ac-002` (Groups), `cap-ac-003` (API Keys), `cap-ac-004` (SSO) | + +--- + +## Quick Reference + +### Name +Role-Based Access Control (RBAC) + +### Tagline +Secure access through role-based permissions. + +### Elevator Pitch +Role-Based Access Control simplifies managing who can do what within your organization by assigning permissions to roles rather than individuals. Users are associated with specific roles, and their access rights are automatically determined by those roles, ensuring consistent security policies across your entire federated graph infrastructure. + +--- + +## Problem & Solution + +### The Problem +As organizations scale their GraphQL federation deployments, managing individual user permissions becomes unmanageable. Without structured access control, teams face security risks from overly permissive access, compliance challenges from inconsistent permissions, and administrative overhead from manually managing each user's access rights. + +### The Solution +Cosmo's RBAC system assigns permissions to roles at organizational, namespace, graph, and subgraph levels. Users inherit permissions through their assigned roles, ensuring consistent access policies. Integration with SSO providers keeps role assignments synchronized with your existing identity infrastructure. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manual permission management per user | Role-based permissions inherited automatically | +| Inconsistent access across environments | Unified access control across namespaces | +| No visibility into who can access what | Clear role hierarchy with defined permissions | +| Security risks from permission sprawl | Principle of least privilege enforced by design | + +--- + +## Key Benefits + +1. **Simplified Administration**: Manage permissions through roles instead of individual user assignments, reducing administrative overhead significantly. + +2. **Consistent Security Policies**: Apply uniform access policies across your organization through well-defined role hierarchies at organization, namespace, graph, and subgraph levels. + +3. **SSO Integration**: Seamlessly synchronize roles with your identity provider, ensuring users always have the correct permissions based on your authorization server. + +4. **Granular Control**: Define access at multiple levels - from organization-wide admin access down to specific subgraph permissions. + +5. **Compliance Ready**: Maintain clear audit trails of role assignments for regulatory and security compliance requirements. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Security Administrator +- **Pain Points**: Managing access for growing teams, ensuring least-privilege access, maintaining compliance +- **Goals**: Implement secure access control that scales with the organization without manual overhead + +### Secondary Personas +- Engineering Managers overseeing team access +- DevOps teams managing CI/CD pipeline access +- Compliance officers ensuring appropriate access controls + +--- + +## Use Cases + +### Use Case 1: Team-Based Access Segmentation +**Scenario**: A platform team needs to give different development teams access only to their specific subgraphs while maintaining read-only visibility into the overall federated graph. + +**How it works**: Create groups for each team with Subgraph Admin roles scoped to their namespaces, plus a Graph Viewer role for the federated graph. Team members automatically receive appropriate permissions. + +**Outcome**: Teams can publish and manage their subgraphs independently while understanding the broader system context without risking accidental changes. + +### Use Case 2: CI/CD Pipeline Access Control +**Scenario**: An organization needs to grant their CI/CD systems the ability to publish schema changes but not modify organization settings or create new graphs. + +**How it works**: Create a dedicated group with Subgraph Publisher role, then generate API keys assigned to that group for pipeline use. + +**Outcome**: Automated systems have precisely the permissions needed for deployments with no risk of unintended administrative actions. + +### Use Case 3: Environment Isolation +**Scenario**: A company wants to ensure developers can freely experiment in dev namespaces but have restricted access to production. + +**How it works**: Configure group rules with Namespace Admin for dev namespace and Namespace Viewer for production. Only senior engineers receive Admin access to production. + +**Outcome**: Development velocity is maintained while production environments remain protected from unauthorized changes. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Multi-level permission hierarchy (Organization, Namespace, Graph, Subgraph) +2. Native SSO integration ensures role synchronization with identity providers +3. Unified access control for both users and API keys through the groups system + +--- + +## Technical Summary + +### How It Works +RBAC in Cosmo operates through a groups-based permission system. Each group contains rules that define roles and their associated resources. Users and API keys are assigned to groups, inheriting all permissions from that group's rules. When access is checked, the system evaluates all applicable role assignments to determine if the action is permitted. + +### Key Technical Features +- Organization-level roles: Admin, Developer, API Key Manager, Viewer +- Namespace-level roles: Admin, Viewer +- Graph-level roles: Admin, Viewer +- Subgraph-level roles: Admin, Publisher, Checker, Viewer +- Built-in groups for common permission patterns +- Custom group creation for specific access requirements + +### Integration Points +- OpenID Connect (OIDC) providers for SSO synchronization +- SCIM for automated user provisioning with role assignment +- API key system for programmatic access + +### Requirements & Prerequisites +- Scale plan or higher +- RBAC must be enabled in organization settings +- SSO configuration recommended for role synchronization + +--- + +## Documentation References + +- Primary docs: `/docs/studio/rbac` +- Groups configuration: `/docs/studio/groups` +- Group rules: `/docs/studio/groups/group-rules` +- SSO integration: `/docs/studio/sso` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL RBAC +- Role-based access control +- Federation access management + +### Secondary Keywords +- GraphQL permissions +- API access control +- Team access management + +### Related Search Terms +- How to manage GraphQL API permissions +- Federation security best practices +- Subgraph access control + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/access-control/scim-provisioning.md b/capabilities/access-control/scim-provisioning.md new file mode 100644 index 00000000..6ca711aa --- /dev/null +++ b/capabilities/access-control/scim-provisioning.md @@ -0,0 +1,184 @@ +# SCIM Provisioning + +Automated user provisioning and deprovisioning through the SCIM standard for seamless identity lifecycle management. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ac-005` | +| **Category** | Access Control | +| **Status** | GA | +| **Availability** | Enterprise | +| **Related Capabilities** | `cap-ac-003` (API Keys), `cap-ac-004` (SSO), `cap-ac-002` (Groups) | + +--- + +## Quick Reference + +### Name +SCIM Provisioning + +### Tagline +Automate user lifecycle from your IdP. + +### Elevator Pitch +SCIM (System for Cross-domain Identity Management) automates user provisioning, updates, and deprovisioning between your identity provider and Cosmo. When users are added to your IdP application, they automatically receive invitations to join Cosmo. When they're removed, their accounts are deactivated. This eliminates manual user management and ensures your Cosmo user base stays synchronized with your organization. + +--- + +## Problem & Solution + +### The Problem +Organizations with significant employee turnover face a constant challenge: ensuring that new employees get access to the tools they need immediately, while departing employees have their access revoked completely. Manual processes are slow, error-prone, and create security risks when offboarding is delayed or forgotten. + +### The Solution +Cosmo's SCIM integration connects to your identity provider to automate the entire user lifecycle. New users added to the SCIM application automatically receive email invitations to join Cosmo. User attribute changes are synchronized in real-time. When users are removed from the SCIM application, their Cosmo accounts are immediately deactivated, ensuring no orphaned access remains. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manual user invitations | Automatic invitation on IdP addition | +| Delayed offboarding | Immediate deactivation on IdP removal | +| Attribute drift between systems | Real-time synchronization | +| Security risks from orphaned accounts | Clean user lifecycle management | + +--- + +## Key Benefits + +1. **Automated Onboarding**: Users added to your SCIM application automatically receive invitations to join Cosmo, reducing time-to-productivity. + +2. **Immediate Offboarding**: User removal from the SCIM application instantly deactivates their Cosmo account, eliminating security gaps. + +3. **Attribute Synchronization**: User attribute changes in your IdP are automatically reflected in Cosmo, keeping data consistent. + +4. **Standards-Based**: Built on the SCIM 2.0 standard, ensuring compatibility with major identity providers and future-proofing your integration. + +5. **SSO Complementary**: Works alongside SSO to provide complete identity management - SSO for authentication, SCIM for provisioning. + +--- + +## Target Audience + +### Primary Persona +- **Role**: IT Administrator / Identity Manager +- **Pain Points**: Manual user provisioning across multiple systems, security risks from delayed offboarding, keeping user data synchronized +- **Goals**: Automate user lifecycle management, ensure immediate access control on employee changes + +### Secondary Personas +- HR teams managing employee onboarding/offboarding +- Security officers ensuring timely access revocation +- Compliance teams requiring audit trails of provisioning + +--- + +## Use Cases + +### Use Case 1: New Employee Onboarding +**Scenario**: A new developer joins the company and needs access to Cosmo as part of their day-one setup. + +**How it works**: +1. HR adds the new employee to the company IdP (e.g., Okta) +2. Employee is assigned to the Cosmo SCIM application +3. Cosmo automatically sends an invitation email to the employee +4. Employee accepts the invitation and gains access with assigned permissions + +**Outcome**: New employees have access to Cosmo on day one without any manual administrator intervention. + +### Use Case 2: Employee Offboarding +**Scenario**: An employee leaves the company and all their access needs to be revoked immediately. + +**How it works**: +1. HR removes the employee from the IdP or the SCIM application +2. SCIM automatically communicates the removal to Cosmo +3. Employee's Cosmo account is immediately deactivated +4. All active sessions are terminated + +**Outcome**: Zero-gap offboarding with no manual steps required, eliminating the risk of unauthorized access by former employees. + +### Use Case 3: Role/Department Change +**Scenario**: An employee transfers from the platform team to the security team and needs different access levels. + +**How it works**: +1. HR updates the employee's group membership in the IdP +2. SCIM synchronizes the attribute changes to Cosmo +3. When combined with SSO role mappings, the user's permissions automatically update + +**Outcome**: Access changes are handled through normal HR workflows without requiring separate Cosmo administration. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Complete lifecycle support: create, update, and deactivate +2. Standards-compliant SCIM 2.0 implementation +3. Complementary to SSO for comprehensive identity management + +--- + +## Technical Summary + +### How It Works +SCIM uses RESTful APIs to communicate user lifecycle events between your identity provider (e.g., Okta) and Cosmo. An API key with SCIM permission serves as the authentication mechanism. When users are created, modified, or deleted in your IdP's SCIM application, those changes are pushed to Cosmo in real-time via the SCIM protocol. + +### Key Technical Features +- SCIM 2.0 protocol support +- Create users: Triggers invitation email +- Update user attributes: Synchronizes profile data +- Deactivate users: Immediately disables account +- Password sync support (provider-dependent) +- HTTP Header authentication using Cosmo API key + +### Integration Points +- Okta (with dedicated setup guide) +- Any SCIM 2.0 compliant identity provider +- Cosmo API key system for authentication +- SSO for complementary authentication management + +### Requirements & Prerequisites +- Enterprise plan +- SCIM-compatible identity provider +- API key with SCIM permission enabled +- SSO recommended for complete identity management +- Matching user assignments between SSO and SCIM apps + +--- + +## Documentation References + +- Primary docs: `/docs/studio/scim` +- Okta setup guide: `/docs/studio/scim/okta` +- API keys: `/docs/studio/api-keys` +- SSO: `/docs/studio/sso` + +--- + +## Keywords & SEO + +### Primary Keywords +- SCIM provisioning +- User provisioning +- Automated user management + +### Secondary Keywords +- Identity lifecycle management +- Okta SCIM integration +- GraphQL user provisioning + +### Related Search Terms +- How to automate user provisioning for GraphQL +- SCIM integration for API platforms +- Automated onboarding for federation tools + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/access-control/session-management.md b/capabilities/access-control/session-management.md new file mode 100644 index 00000000..321ce20a --- /dev/null +++ b/capabilities/access-control/session-management.md @@ -0,0 +1,182 @@ +# Session Management + +User session tracking and activity monitoring with security-focused timeouts and authentication controls. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ac-008` | +| **Category** | Access Control | +| **Status** | GA | +| **Availability** | Free, Pro, Scale, Enterprise | +| **Related Capabilities** | `cap-ac-004` (SSO), `cap-ac-006` (Audit Logging) | + +--- + +## Quick Reference + +### Name +Session Management + +### Tagline +Secure sessions with smart timeouts. + +### Elevator Pitch +Session Management in Cosmo follows industry security standards to balance user convenience with security. Sessions automatically renew during active use, timeout after inactivity, and have a maximum lifetime to ensure regular reauthentication. High-risk operations are protected with additional confirmation steps, and users can leverage their existing identity providers for enhanced security. + +--- + +## Problem & Solution + +### The Problem +Balancing security and usability in session management is challenging. Sessions that never expire create security risks, while sessions that expire too quickly frustrate users. Without proper controls, organizations face risks from abandoned sessions, shared credentials, and inadequate protection for sensitive operations. + +### The Solution +Cosmo implements session management following security standards established by Auth0, Cloudflare, and other industry leaders. Sessions renew automatically every 8 hours during active use, terminate after 72 hours of inactivity, and have a maximum lifetime of 14 days. High-risk operations require email confirmation, providing defense-in-depth for the most sensitive actions. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Indefinite session concerns | 14-day maximum session lifetime | +| No inactivity timeout | 72-hour inactivity timeout | +| All operations equally trusted | High-risk operations require confirmation | +| Password-only authentication | SSO and social login integration | + +--- + +## Key Benefits + +1. **Industry-Standard Security**: Session policies follow best practices from Auth0, Cloudflare, and other security leaders, ensuring appropriate protection. + +2. **Seamless Active Use**: Sessions automatically renew every 8 hours during active use, minimizing disruption for users who are actively working. + +3. **Automatic Cleanup**: 72-hour inactivity timeout ensures abandoned sessions don't remain active indefinitely, reducing security exposure. + +4. **Maximum Lifetime Protection**: 14-day session limit ensures users reauthenticate regularly, reducing risk from compromised sessions. + +5. **High-Risk Operation Protection**: Sensitive actions like organization deletion require email confirmation, preventing accidental or unauthorized destructive changes. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Security Administrator / Platform Administrator +- **Pain Points**: Ensuring appropriate session security, protecting sensitive operations, meeting compliance requirements for session management +- **Goals**: Implement secure session management without frustrating users + +### Secondary Personas +- Compliance officers verifying security controls +- End users expecting secure, convenient access +- IT administrators managing authentication methods + +--- + +## Use Cases + +### Use Case 1: Active Developer Workflow +**Scenario**: A developer works in Cosmo Studio throughout the day, managing schemas and reviewing analytics. + +**How it works**: +1. Developer logs in at start of day +2. Session automatically renews every 8 hours during active use +3. Developer works uninterrupted throughout their workday +4. Session remains valid as long as activity continues + +**Outcome**: Productive workflow without constant reauthentication, while maintaining security through regular renewal. + +### Use Case 2: Abandoned Session Protection +**Scenario**: An engineer leaves for vacation without explicitly logging out. + +**How it works**: +1. Engineer's last activity was Friday afternoon +2. No activity over the weekend or following week +3. After 72 hours of inactivity, session automatically terminates +4. If someone accesses the browser later, they must reauthenticate + +**Outcome**: Automatic protection against abandoned sessions without requiring explicit logout. + +### Use Case 3: Sensitive Operation Protection +**Scenario**: An admin attempts to delete an organization (potentially accidental click or unauthorized access). + +**How it works**: +1. Admin initiates organization deletion +2. System requires email confirmation before proceeding +3. Confirmation email sent to admin's registered email +4. Only after confirmation does deletion proceed + +**Outcome**: Defense-in-depth protection ensures destructive operations require multi-factor confirmation. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Session policies based on industry-leading security standards +2. Email confirmation for high-risk operations +3. Flexible authentication via SSO/social providers + +--- + +## Technical Summary + +### How It Works +Sessions are created upon successful authentication and tracked server-side. During active use, sessions renew every 8 hours. Inactivity for 72 hours triggers automatic session termination. Regardless of activity, sessions have a maximum lifetime of 14 days, after which reauthentication is required. High-risk operations trigger an additional email confirmation flow. + +### Key Technical Features +- 8-hour session renewal during active use +- 72-hour inactivity timeout +- 14-day maximum session lifetime +- Email confirmation for high-risk operations (e.g., organization deletion) +- Google and GitHub social login support +- SSO integration for enterprise identity providers + +### Integration Points +- Google OAuth for social login +- GitHub OAuth for social login +- SSO via OIDC for enterprise identity +- Email system for high-risk confirmations + +### Requirements & Prerequisites +- No special configuration required - session management is automatic +- SSO available for organizations wanting IdP-controlled authentication +- Email access required for high-risk operation confirmation + +--- + +## Documentation References + +- Primary docs: `/docs/studio/sessions` +- SSO: `/docs/studio/sso` +- Audit logging: `/docs/studio/audit-log` + +--- + +## Keywords & SEO + +### Primary Keywords +- Session management +- Authentication security +- Session timeout + +### Secondary Keywords +- GraphQL session security +- Login session control +- Access timeout + +### Related Search Terms +- How long do GraphQL sessions last +- API platform session security +- Federation authentication timeouts + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/access-control/single-sign-on.md b/capabilities/access-control/single-sign-on.md new file mode 100644 index 00000000..38091cff --- /dev/null +++ b/capabilities/access-control/single-sign-on.md @@ -0,0 +1,186 @@ +# Single Sign-On (SSO) + +OIDC-based authentication supporting Okta, Auth0, Keycloak, and Microsoft Entra for unified identity management. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ac-004` | +| **Category** | Access Control | +| **Status** | GA | +| **Availability** | Enterprise | +| **Related Capabilities** | `cap-ac-001` (RBAC), `cap-ac-002` (Groups), `cap-ac-005` (SCIM) | + +--- + +## Quick Reference + +### Name +Single Sign-On (SSO) + +### Tagline +Unified authentication with your identity provider. + +### Elevator Pitch +Cosmo integrates with your existing identity provider through OpenID Connect, enabling seamless authentication for your entire organization. Users sign in with their existing credentials and automatically receive appropriate permissions based on your configured role mappings. No separate password management, no manual user provisioning - just secure, unified access. + +--- + +## Problem & Solution + +### The Problem +Organizations managing multiple tools face authentication sprawl - separate passwords, inconsistent access controls, and manual user management for each system. When employees join, leave, or change roles, each system requires individual updates, creating security risks and administrative burden. + +### The Solution +Cosmo's SSO integration connects to your existing OpenID Connect identity provider, centralizing authentication through your established identity infrastructure. Users are automatically enrolled in your organization when they sign in, receiving roles based on your configured mappings. When SSO is disconnected, all users are safely downgraded to viewer access as a security measure. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Separate credentials per system | Single sign-on with existing identity | +| Manual user provisioning | Automatic enrollment on first sign-in | +| Inconsistent role management | Centralized role mapping from IdP | +| Password management overhead | Leverages existing IdP security | + +--- + +## Key Benefits + +1. **Seamless User Experience**: Users sign in with their existing organizational credentials - no new passwords to remember or manage. + +2. **Automatic User Enrollment**: When users sign in via SSO for the first time, they are automatically added to your organization with appropriate permissions. + +3. **Role Synchronization**: Roles are assigned based on mappings configured during SSO setup, ensuring permissions stay synchronized with your authorization server. + +4. **Security by Default**: Disconnecting SSO automatically downgrades all SSO users to viewer role, preventing unauthorized access if synchronization is lost. + +5. **Provider Flexibility**: Support for major OIDC providers including Okta, Auth0, Keycloak, and any OIDC-compliant identity provider. + +--- + +## Target Audience + +### Primary Persona +- **Role**: IT Administrator / Identity Manager +- **Pain Points**: Managing multiple identity systems, ensuring consistent access across tools, user lifecycle management +- **Goals**: Centralize identity management, reduce security risks from separate credentials, streamline onboarding/offboarding + +### Secondary Personas +- Security officers ensuring compliance with identity policies +- Platform administrators managing user access +- End users seeking simpler authentication + +--- + +## Use Cases + +### Use Case 1: Enterprise Okta Integration +**Scenario**: A company uses Okta for all employee authentication and wants Cosmo access managed through Okta. + +**How it works**: +1. Configure OIDC integration in Cosmo with Okta credentials +2. Set up role mappings to assign Cosmo roles based on Okta groups +3. Share the generated Login URL with employees +4. Employees sign in using "Login with SSO" and receive appropriate permissions + +**Outcome**: Employees access Cosmo using existing Okta credentials with permissions automatically assigned based on their Okta group membership. + +### Use Case 2: Multi-Environment Access Control +**Scenario**: Different teams need different access levels based on their identity provider group membership. + +**How it works**: +1. Configure SSO with role mappings linking IdP groups to Cosmo groups +2. Map "platform-admins" IdP group to Organization Admin role +3. Map "developers" IdP group to Developer role +4. Map "viewers" IdP group to Viewer role + +**Outcome**: Team permissions are automatically determined by IdP group membership, ensuring consistent access across environments. + +### Use Case 3: Contractor Access Management +**Scenario**: An organization needs to provide contractors limited access that can be easily revoked when the contract ends. + +**How it works**: +1. Create contractors in IdP with specific group membership +2. Map contractor IdP group to a Cosmo group with Viewer access only +3. When contract ends, remove user from IdP - they lose Cosmo access immediately + +**Outcome**: Contractor access is fully controlled through the existing HR/identity workflow with no separate Cosmo management needed. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Automatic security downgrade when SSO is disconnected +2. Unique organization-specific login URLs for each SSO integration +3. Native support for major enterprise identity providers + +--- + +## Technical Summary + +### How It Works +SSO integration uses the OpenID Connect (OIDC) protocol to authenticate users through your identity provider. When users sign in via the organization-specific Login URL, Cosmo validates their identity with the IdP and applies role mappings to determine their permissions. Sessions are maintained according to Cosmo's session management policies. + +### Key Technical Features +- OIDC 2.0 protocol compliance +- Support for Okta, Auth0, Keycloak, Microsoft Entra +- Custom OIDC provider support for compliant systems +- Organization-specific Login URLs +- Automatic user enrollment on first sign-in +- Role mapping configuration during setup +- Secure disconnection with automatic access downgrade + +### Integration Points +- Okta (with dedicated setup guide) +- Auth0 (with dedicated setup guide) +- Keycloak (with dedicated setup guide) +- Any OIDC-compliant identity provider +- SCIM for enhanced user lifecycle management + +### Requirements & Prerequisites +- Enterprise plan +- OIDC-compliant identity provider +- Administrator access to both Cosmo and IdP +- Configuration of role mappings during setup + +--- + +## Documentation References + +- Primary docs: `/docs/studio/sso` +- Okta setup: `/docs/studio/sso/okta` +- Auth0 setup: `/docs/studio/sso/auth0` +- Keycloak setup: `/docs/studio/sso/keycloak` +- RBAC: `/docs/studio/rbac` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL SSO +- Single sign-on +- OIDC authentication + +### Secondary Keywords +- Enterprise authentication +- Identity provider integration +- Okta GraphQL integration + +### Related Search Terms +- How to set up SSO for GraphQL +- Federation identity management +- OIDC integration for API management + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/access-control/user-invitations.md b/capabilities/access-control/user-invitations.md new file mode 100644 index 00000000..1f4b1d7a --- /dev/null +++ b/capabilities/access-control/user-invitations.md @@ -0,0 +1,184 @@ +# User Invitations + +Team member onboarding and collaboration through streamlined invitation workflows. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ac-007` | +| **Category** | Access Control | +| **Status** | GA | +| **Availability** | Free, Pro, Scale, Enterprise | +| **Related Capabilities** | `cap-ac-001` (RBAC), `cap-ac-002` (Groups), `cap-ac-005` (SCIM) | + +--- + +## Quick Reference + +### Name +User Invitations + +### Tagline +Simple team onboarding in clicks. + +### Elevator Pitch +User Invitations provide a straightforward way for organization admins to bring team members into Cosmo. Invite colleagues via email, and they receive a link to join your organization with appropriate permissions. Whether they're new to Cosmo or existing users, the invitation flow guides them through setup and grants access to your organization's resources. + +--- + +## Problem & Solution + +### The Problem +Getting new team members access to shared tools often involves manual account creation, separate credential management, and unclear permission assignment. When colleagues from other teams need access, administrators must navigate complex provisioning processes, delaying collaboration and productivity. + +### The Solution +Cosmo's invitation system enables admins to invite users with a simple email workflow. New users receive instructions to set up their password, while existing Cosmo users are directed to their invitations page. All invitees can accept or decline, and upon acceptance, they immediately gain access to organization resources based on their assigned permissions. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manual account creation | Email-based invitation flow | +| Unclear onboarding steps | Guided setup experience | +| Delayed access for new team members | Immediate access upon acceptance | +| Complex permission assignment | Permissions assigned at invitation | + +--- + +## Key Benefits + +1. **Frictionless Onboarding**: New team members receive clear email instructions and can be productive in minutes, not days. + +2. **Choice for Invitees**: Users can accept or decline invitations, maintaining control over their organization memberships. + +3. **Existing User Support**: Users already on Cosmo can easily join additional organizations without creating new accounts. + +4. **New User Guided Setup**: Users new to Cosmo receive password setup instructions, ensuring a smooth first-time experience. + +5. **Admin Control**: Organization admins manage who gets invited and can track pending invitations. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Organization Admin / Team Lead +- **Pain Points**: Getting new team members access quickly, managing access for cross-functional collaborators +- **Goals**: Enable team productivity with minimal administrative overhead + +### Secondary Personas +- New employees joining a team +- Contractors needing temporary access +- Cross-team collaborators requiring visibility + +--- + +## Use Cases + +### Use Case 1: New Employee Onboarding +**Scenario**: A new developer joins the platform team and needs access to the team's Cosmo organization. + +**How it works**: +1. Admin navigates to organization settings and initiates an invitation +2. Enters the new employee's email address +3. New employee receives email with setup instructions +4. Employee creates password and accepts invitation +5. Employee gains immediate access to organization resources + +**Outcome**: New team member is productive on day one with proper access to all necessary resources. + +### Use Case 2: Cross-Team Collaboration +**Scenario**: A developer from another team needs read access to understand the federated graph architecture. + +**How it works**: +1. Admin invites the developer with Viewer permissions +2. Developer (existing Cosmo user) receives notification +3. Developer navigates to invitations page and accepts +4. Developer gains read access to the organization + +**Outcome**: Cross-team visibility enabled without compromising security or creating unnecessary access. + +### Use Case 3: Contractor Onboarding +**Scenario**: An external contractor needs temporary access to contribute to a specific subgraph. + +**How it works**: +1. Admin invites contractor email with scoped permissions via group assignment +2. Contractor receives setup email and creates account +3. Contractor accepts invitation and begins work +4. When contract ends, admin removes the user from the organization + +**Outcome**: Controlled access for external contributors with clear onboarding and offboarding paths. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Dual-path flow for new and existing users +2. Accept/decline choice for invitees +3. Integrated with RBAC for immediate permission assignment + +--- + +## Technical Summary + +### How It Works +Organization admins initiate invitations through the Studio interface. Cosmo sends an email to the invitee with instructions appropriate to their status - password setup for new users, or a direct link for existing users. Invitations appear on the invitee's invitations page where they can accept or decline. Acceptance immediately grants access to the organization with assigned permissions. + +### Key Technical Features +- Email-based invitation delivery +- New user password setup flow +- Existing user direct acceptance +- Accept/decline options for invitees +- Invitations page for managing pending invitations +- Admin visibility into invitation status + +### Integration Points +- Email delivery system +- Cosmo authentication system +- RBAC for permission assignment +- SCIM as an automated alternative + +### Requirements & Prerequisites +- Admin role to send invitations +- Valid email address for invitees +- No plan restrictions - available on all tiers + +--- + +## Documentation References + +- Primary docs: `/docs/studio/invitations` +- RBAC: `/docs/studio/rbac` +- Groups: `/docs/studio/groups` +- SCIM (for automated provisioning): `/docs/studio/scim` + +--- + +## Keywords & SEO + +### Primary Keywords +- User invitations +- Team onboarding +- Member management + +### Secondary Keywords +- GraphQL team access +- Organization invitations +- Collaboration access + +### Related Search Terms +- How to invite team members to GraphQL platform +- Add users to federation organization +- Team onboarding for API management + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/ai/mcp.md b/capabilities/ai/mcp.md new file mode 100644 index 00000000..7459dc9b --- /dev/null +++ b/capabilities/ai/mcp.md @@ -0,0 +1,234 @@ +# MCP (Model Context Protocol) Gateway + +Expose persisted operations to LLMs through MCP for secure, controlled AI integration with your GraphQL APIs. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ai-001` | +| **Category** | AI & LLMs | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-ops-persisted` (Persisted Operations) | + +--- + +## Quick Reference + +### Name +MCP Gateway + +### Tagline +Connect AI models to your GraphQL API securely. + +### Elevator Pitch +WunderGraph's MCP Gateway enables AI models like Claude, Cursor, and Windsurf to discover and interact with your GraphQL APIs through a standardized protocol. By exposing only predefined, validated operations as tools, you maintain complete control over what data AI systems can access while enabling powerful AI-driven workflows. + +--- + +## Problem & Solution + +### The Problem +Organizations want to integrate AI assistants into their workflows to improve productivity and customer experience, but face significant challenges: + +1. **Security risks**: Allowing AI models to execute arbitrary queries against APIs can expose sensitive data or enable unintended operations. +2. **Integration complexity**: Each AI model or platform requires custom integration code, creating development bottlenecks. +3. **Lack of governance**: Without granular control, organizations cannot meet regulatory requirements for tracking and limiting what data AI systems can access. +4. **Documentation burden**: AI models need to understand API capabilities, requiring extensive documentation and context. + +### The Solution +Cosmo's MCP Gateway provides a secure bridge between AI models and your GraphQL APIs. It exposes only predefined, validated GraphQL operations as tools that AI models can discover and use. The protocol provides rich schema information and descriptions, making your API self-documenting for AI consumption. This approach gives you complete control over what operations AI can execute while enabling seamless integration with any MCP-compatible AI platform. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Custom integration code for each AI platform | Single MCP endpoint works with all MCP-compatible AI tools | +| Risk of AI executing arbitrary, potentially harmful queries | Only predefined, validated operations are exposed | +| Months of security review for AI integrations | Compliance sign-off in weeks with operation-level control | +| External documentation required for AI to understand APIs | Self-documenting operations with embedded descriptions | +| Separate "AI-safe" APIs needed | Same GraphQL API with controlled operation exposure | + +--- + +## Key Benefits + +1. **Secure by Design**: AI models can only execute predefined, validated GraphQL operations, eliminating the risk of arbitrary query execution against your data. + +2. **Universal AI Compatibility**: Works with Claude, Cursor, Windsurf, VS Code Copilot, and any MCP-compatible AI platform without custom integration code. + +3. **Self-Documenting Operations**: Embed rich descriptions directly in your GraphQL operations using the September 2025 GraphQL spec, making them immediately understandable to AI models. + +4. **Granular Access Control**: Control exactly which operations and fields are exposed to AI systems, with the ability to exclude mutations entirely for read-only access. + +5. **Federation-Ready**: Works seamlessly with federated GraphQL schemas, giving AI access to data across your entire organization through a single endpoint. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Architect +- **Pain Points**: Need to enable AI integrations without compromising security; tired of building custom integrations for each AI tool; concerned about data governance and compliance +- **Goals**: Provide secure, controlled AI access to APIs; reduce integration complexity; maintain single source of truth for API access + +### Secondary Personas +- **AI/ML Engineers**: Want to leverage organizational data in AI workflows without waiting for custom integrations +- **Security/Compliance Teams**: Need visibility and control over what data AI systems can access +- **Engineering Managers**: Want to accelerate AI adoption while managing risk + +--- + +## Use Cases + +### Use Case 1: AI-Powered Customer Support +**Scenario**: A financial services company wants AI assistants to help support agents answer customer questions about transactions, but cannot expose sensitive account data. + +**How it works**: +1. Define GraphQL operations that return only non-sensitive transaction fields (masked merchant names, categories, status) +2. Add clear descriptions explaining what data is included and excluded +3. Configure MCP Gateway to expose only these read-only operations +4. Support AI tools connect via MCP and can answer questions like "Did my payment go through?" without accessing full account numbers + +**Outcome**: AI assistants provide helpful customer support while maintaining strict data boundaries and compliance requirements. + +### Use Case 2: Developer Productivity with AI Coding Assistants +**Scenario**: An engineering team wants Cursor or VS Code Copilot to understand and work with their internal APIs during development. + +**How it works**: +1. Create GraphQL operations for common development tasks (fetching user data, checking order status, etc.) +2. Configure MCP Gateway with the operations directory +3. Developers add the MCP endpoint to their AI coding assistant +4. AI assistants can now query real API data to help with debugging, testing, and development + +**Outcome**: Developers get AI-powered assistance that understands their actual API structure and can fetch real data for context. + +### Use Case 3: AI-Driven Internal Tools +**Scenario**: A company wants to build AI chatbots and assistants that can query internal data across multiple microservices. + +**How it works**: +1. Federated GraphQL schema combines data from multiple services +2. Define purpose-built operations for AI consumption with clear descriptions +3. MCP Gateway exposes these operations to internal AI tools +4. AI assistants can access cross-service data through a single, controlled interface + +**Outcome**: Internal AI tools have access to comprehensive organizational data while respecting access boundaries and maintaining audit trails. + +--- + +## Competitive Positioning + +### Key Differentiators +1. **Built on Persisted Operations**: Unlike solutions that expose entire schemas, MCP Gateway uses the proven persisted operations pattern for security +2. **Native GraphQL Federation Support**: Works seamlessly with federated schemas for organization-wide AI access +3. **Zero Custom Code**: No integration code needed - works with any MCP-compatible AI platform out of the box + +### Comparison with Alternatives + +| Aspect | Cosmo MCP Gateway | Custom REST APIs | Raw GraphQL Exposure | +|--------|-------------------|------------------|---------------------| +| Security | Only predefined operations | Must build from scratch | Full schema exposed | +| Setup Time | Minutes | Months | Minutes | +| AI Compatibility | All MCP platforms | Custom per platform | Limited | +| Documentation | Self-documenting | Manual | Schema only | +| Federation Support | Native | Must aggregate | Varies | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We already have REST APIs for AI" | MCP Gateway provides standardized AI discovery and eliminates per-platform integration work. One endpoint serves all MCP-compatible AI tools. | +| "Is MCP a stable standard?" | MCP is backed by major AI platforms including Anthropic, and Cosmo supports the latest 2025-06-18 specification with Streamable HTTP. | +| "How do we control what AI can access?" | You define exactly which operations are exposed. Use `exclude_mutations: true` for read-only access, and operation descriptions to guide AI behavior. | + +--- + +## Technical Summary + +### How It Works +The Cosmo Router implements an MCP server that loads GraphQL operations from a specified directory, validates them against your schema, and exposes them as tools. AI models discover available tools, read their descriptions and input schemas, and execute them through the standardized MCP protocol. The router handles execution and returns structured data that AI models can interpret and use. + +### Key Technical Features +- Streamable HTTP support (MCP specification 2025-06-18) +- JSON schema generation for operation variables +- Operation description extraction (September 2025 GraphQL spec) +- Stateless mode for scalable deployments +- Full header forwarding for authentication and tracing +- Configurable mutation exclusion + +### Integration Points +- Claude Desktop +- Cursor (v0.48.0+) +- Windsurf +- VS Code Copilot +- Any MCP-compatible AI platform + +### Requirements & Prerequisites +- Cosmo Router with MCP enabled +- GraphQL operations directory with `.graphql` files +- Storage provider configuration for operations + +--- + +## Proof Points + +### Metrics & Benchmarks +- 95% reduction in security review effort for AI integrations +- Compliance sign-off in weeks instead of months +- Zero custom integration code required per AI platform + +### Case Studies +- Financial services company achieved compliance for AI customer support in weeks vs. 6+ month estimate for custom REST APIs + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Exists | https://wundergraph.com/mcp-gateway | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/mcp` +- Persisted Operations: `/docs/router/persisted-operations` +- Header Forwarding: `/docs/router/proxy-capabilities/request-headers-operations` + +--- + +## Keywords & SEO + +### Primary Keywords +- MCP Gateway +- Model Context Protocol GraphQL +- AI GraphQL integration + +### Secondary Keywords +- LLM API access +- AI-safe API +- GraphQL for AI models + +### Related Search Terms +- How to connect AI to GraphQL +- Secure AI API access +- Claude GraphQL integration +- Cursor MCP setup +- AI assistant API integration + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/analytics/analytics-dashboard.md b/capabilities/analytics/analytics-dashboard.md new file mode 100644 index 00000000..a3a2a63c --- /dev/null +++ b/capabilities/analytics/analytics-dashboard.md @@ -0,0 +1,148 @@ +# Analytics Dashboard + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-analytics-dashboard` | +| **Category** | Analytics | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-metrics-analytics`, `cap-trace-analytics`, `cap-schema-field-usage`, `cap-client-identification` | + +--- + +## Quick Reference + +### Name +Analytics Dashboard + +### Tagline +Comprehensive request analytics with powerful filtering and grouping. + +### Elevator Pitch +The Analytics Dashboard provides a detailed breakdown of all requests made to your federated graph. With built-in grouping, filtering, and date range selection, teams can quickly analyze API traffic patterns, identify performance bottlenecks, and understand how clients interact with their GraphQL services. + +--- + +## Problem & Solution + +### The Problem +Platform teams managing federated GraphQL APIs need visibility into request patterns, but raw logs and basic metrics lack the context needed to understand how different operations, clients, and time periods affect system behavior. Without proper analytics, teams struggle to identify trends, debug issues, and make data-driven decisions about their API. + +### The Solution +Cosmo's Analytics Dashboard centralizes all request data into a single, intuitive interface. It provides multiple views including metrics, traces, schema field usage, and client identification, with powerful filtering and grouping capabilities that let teams slice and dice their data to answer any question about their API traffic. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Scattered logs across multiple services | Unified analytics dashboard for entire federated graph | +| Manual correlation of metrics and traces | Integrated views linking metrics to individual requests | +| Limited filtering on raw data | Rich filtering by operation, client, date range, and more | +| No grouping capabilities | Group by operation name, client, or error message | + +--- + +## Key Benefits + +1. **Unified Visibility**: Single dashboard for all federated graph analytics including metrics, traces, and field usage +2. **Flexible Analysis**: Group data by operation name, client, or error message to identify patterns +3. **Time-Based Insights**: Select custom date ranges or predefined periods to analyze trends over time +4. **Filter-Driven Investigation**: Narrow down to specific requests using powerful filtering capabilities +5. **Integrated Views**: Seamlessly navigate between metrics overview, individual traces, and schema usage + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: Need comprehensive visibility into API traffic; difficulty correlating metrics across federated services +- **Goals**: Monitor system health; identify performance issues; understand client usage patterns + +### Secondary Personas +- API developers debugging specific operations +- Engineering managers tracking API adoption and usage +- Product teams understanding feature usage through API traffic + +--- + +## Use Cases + +### Use Case 1: Traffic Pattern Analysis +**Scenario**: An engineering team wants to understand peak usage times and traffic distribution across their federated graph. +**How it works**: Use the Analytics Dashboard to select a date range spanning a week, then group by operation name to see which operations drive the most traffic. Use the metrics view to visualize request rates over time. +**Outcome**: Team identifies peak hours and most-used operations, enabling capacity planning and optimization prioritization. + +### Use Case 2: Client Usage Monitoring +**Scenario**: A platform team needs to understand which client applications are consuming their API and how usage varies by client version. +**How it works**: Navigate to the Analytics Dashboard and group data by client. Filter by date range to compare usage across different time periods. Drill down into specific clients to see their operation patterns. +**Outcome**: Team gains visibility into client adoption, can identify outdated client versions, and prioritize client-specific optimizations. + +### Use Case 3: Error Investigation +**Scenario**: A spike in errors is detected and the team needs to identify the root cause. +**How it works**: Use the Analytics Dashboard to filter by error status and group by error message. Identify the most common errors, then drill down to specific traces to understand the context of failures. +**Outcome**: Team quickly identifies the error pattern and can trace it back to specific operations or clients causing issues. + +--- + +## Technical Summary + +### How It Works +The Analytics Dashboard aggregates telemetry data collected by the Cosmo Router through OpenTelemetry instrumentation. Data is stored and indexed to enable fast querying with filters and grouping. The dashboard provides multiple views (metrics, traces, field usage) that share the same underlying data with consistent filtering capabilities. + +### Key Technical Features +- Date range selection with predefined ranges and custom date/time picker +- Multiple grouping options: none, operation name, client, error message +- Cross-view navigation between metrics, traces, and field usage +- Real-time data with configurable auto-refresh intervals +- Filter persistence across views + +### Integration Points +- Cosmo Router (data collection) +- OpenTelemetry (instrumentation standard) +- Cosmo Studio (visualization) + +### Requirements & Prerequisites +- Cosmo Router deployed and configured +- OTEL instrumentation enabled on the router +- Client applications configured with proper headers for client identification + +--- + +## Documentation References + +- Primary docs: `/docs/studio/analytics` +- Metrics documentation: `/docs/studio/analytics/metrics` +- Traces documentation: `/docs/studio/analytics/traces` +- Schema Field Usage: `/docs/studio/analytics/schema-field-usage` +- Client Identification: `/docs/studio/analytics/client-identification` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL analytics +- Federated graph analytics +- API request analytics + +### Secondary Keywords +- GraphQL metrics dashboard +- API traffic analysis +- Federation observability + +### Related Search Terms +- GraphQL monitoring dashboard +- Federated GraphQL analytics +- API request filtering and grouping +- GraphQL client usage tracking + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/analytics/client-identification.md b/capabilities/analytics/client-identification.md new file mode 100644 index 00000000..136d819c --- /dev/null +++ b/capabilities/analytics/client-identification.md @@ -0,0 +1,162 @@ +# Client Identification + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-client-identification` | +| **Category** | Analytics | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-analytics-dashboard`, `cap-schema-field-usage`, `cap-operations-tracking` | + +--- + +## Quick Reference + +### Name +Client Identification + +### Tagline +Track client versions and usage patterns across your API. + +### Elevator Pitch +Client Identification enables teams to distinguish and track different client applications consuming their federated GraphQL API. By adding simple headers to requests, clients are automatically identified and their usage is tracked throughout the analytics platform, enabling client-specific filtering, usage analysis, and targeted deprecation communication. + +--- + +## Problem & Solution + +### The Problem +When multiple client applications (web, mobile, partners) consume a GraphQL API, teams lose visibility into which clients are responsible for specific traffic patterns, errors, or field usage. Without client identification, it's impossible to make targeted performance optimizations, communicate breaking changes to affected teams, or understand client-specific behavior. + +### The Solution +Cosmo's Client Identification uses standard HTTP headers to identify and track client applications. By adding client name and version headers to requests, all analytics data becomes client-aware. Teams can filter metrics, traces, and field usage by client, enabling targeted analysis and communication. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Anonymous traffic with no client attribution | Every request identified by client and version | +| Unable to isolate client-specific issues | Filter analytics by specific client versions | +| Blanket communication about breaking changes | Targeted communication to affected clients only | +| No visibility into client version distribution | Track which versions are active across your API | + +--- + +## Key Benefits + +1. **Client-Aware Analytics**: Filter all analytics data by client name and version +2. **Version Tracking**: Understand which client versions are actively using your API +3. **Targeted Communication**: Identify exactly which clients use deprecated fields or problematic operations +4. **Easy Implementation**: Simple HTTP headers enable identification without SDK changes +5. **Ecosystem Compatibility**: Supports both vendor-neutral and Apollo-compatible header formats + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Team Lead +- **Pain Points**: No visibility into which clients drive traffic; unable to target communication for breaking changes +- **Goals**: Understand client usage patterns; communicate changes effectively to client teams + +### Secondary Personas +- Client application developers tracking their app's API usage +- Engineering managers overseeing multi-client API platforms +- Support teams investigating client-specific issues + +--- + +## Use Cases + +### Use Case 1: Client Version Monitoring +**Scenario**: A team wants to understand which versions of their mobile app are still actively calling the API to plan deprecation of older API features. +**How it works**: View analytics with client grouping to see all unique client name and version combinations. Identify older versions still making requests and their traffic volume. Use this data to set sunset timelines for older API features. +**Outcome**: Team has clear visibility into client version distribution and can make informed decisions about API evolution. + +### Use Case 2: Client-Specific Performance Investigation +**Scenario**: Users of the iOS app report slow response times, but web users don't experience the same issue. +**How it works**: Filter metrics and traces by the iOS client name. Compare P95 latency and error rates against other clients. Analyze which operations the iOS client uses most frequently to identify potential optimization targets. +**Outcome**: Team isolates iOS-specific performance issues and can work with the mobile team on targeted optimizations. + +### Use Case 3: Deprecation Impact Assessment +**Scenario**: A field is being deprecated and the team needs to notify affected client teams. +**How it works**: Use Schema Field Usage to identify which clients use the deprecated field. Get the list of client names and their request counts. Reach out to each client team with specific data about their usage. +**Outcome**: Client teams receive targeted, relevant communication with specific data about how the deprecation affects them. + +--- + +## Technical Summary + +### How It Works +Client Identification relies on HTTP headers sent with each GraphQL request. The Cosmo Router extracts these headers and associates the client information with all telemetry data. This enables client-aware filtering and grouping throughout the analytics platform. + +### Header Formats + +**Vendor-Neutral (Recommended):** +``` +GraphQL-Client-Name: +GraphQL-Client-Version: +``` + +**Apollo-Compatible:** +``` +ApolloGraphQL-Client-Name: +ApolloGraphQL-Client-Version: +``` + +### Key Technical Features +- Automatic header extraction by Cosmo Router +- Client attribution across all analytics data +- Support for both vendor-neutral and Apollo header formats +- Client-based filtering in metrics, traces, and field usage +- Client-based grouping in trace analytics + +### Integration Points +- Cosmo Router (header extraction) +- All analytics views (client filtering) +- Schema Field Usage (per-client field usage) +- Trace Analytics (client grouping) + +### Requirements & Prerequisites +- Client applications must include client identification headers +- Cosmo Router deployed (header extraction is automatic) +- No additional configuration required on Cosmo side + +--- + +## Documentation References + +- Primary docs: `/docs/studio/analytics/client-identification` +- Analytics overview: `/docs/studio/analytics` +- Schema Field Usage: `/docs/studio/analytics/schema-field-usage` +- Traces documentation: `/docs/studio/analytics/traces` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL client tracking +- API client identification +- Client version analytics + +### Secondary Keywords +- GraphQL client headers +- API consumer tracking +- Multi-client GraphQL analytics + +### Related Search Terms +- GraphQL client name header +- Track GraphQL client versions +- API client usage monitoring +- GraphQL consumer identification + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/analytics/metrics-analytics.md b/capabilities/analytics/metrics-analytics.md new file mode 100644 index 00000000..b99f4356 --- /dev/null +++ b/capabilities/analytics/metrics-analytics.md @@ -0,0 +1,147 @@ +# Metrics Analytics + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-metrics-analytics` | +| **Category** | Analytics | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-analytics-dashboard`, `cap-trace-analytics`, `cap-operations-tracking` | + +--- + +## Quick Reference + +### Name +Metrics Analytics + +### Tagline +Request rate, latency, and error tracking at a glance. + +### Elevator Pitch +Metrics Analytics provides a high-level performance overview of your federated graph with key indicators including request rate, P95 latency, and error percentage. Teams can quickly assess system health, identify trends over time, and drill down into specific time ranges or filter by operation name, client, and version. + +--- + +## Problem & Solution + +### The Problem +Engineering teams need to quickly understand the health and performance of their federated GraphQL API. Without aggregated metrics, teams must piece together information from multiple sources, making it difficult to get a quick overview of system status or identify when performance degraded. + +### The Solution +Cosmo's Metrics Analytics presents the most important performance indicators in a clear, visual format. Request rate shows throughput, P95 latency reveals performance characteristics, and error rate highlights reliability issues. Time-based charts show how these metrics evolve, while filters allow teams to zoom in on specific operations or clients. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Multiple dashboards for different metrics | Single unified metrics view | +| Delayed awareness of performance issues | Real-time visibility into request rate and latency | +| Aggregating error data from multiple services | Consolidated error rate across entire federated graph | +| Manual calculation of percentiles | Automatic P95 latency calculation | + +--- + +## Key Benefits + +1. **Instant Health Assessment**: See request rate, P95 latency, and error rate at a glance +2. **Trend Visualization**: Track how metrics change over time with visual charts +3. **Granular Filtering**: Filter metrics by operation name, client name, and client version +4. **Time Range Flexibility**: Analyze metrics across custom date ranges or predefined periods +5. **Error Breakdown**: Understand error distribution including 4xx and 5xx error types + +--- + +## Target Audience + +### Primary Persona +- **Role**: Site Reliability Engineer (SRE) +- **Pain Points**: Need quick visibility into system health; difficulty identifying performance degradation +- **Goals**: Maintain SLAs; proactively identify and address performance issues + +### Secondary Personas +- Platform engineers monitoring API performance +- Engineering managers tracking reliability metrics +- DevOps teams maintaining operational dashboards + +--- + +## Use Cases + +### Use Case 1: Performance Baseline Establishment +**Scenario**: A team is launching a new federated graph and needs to establish performance baselines for SLA definition. +**How it works**: Use Metrics Analytics to observe request rate, P95 latency, and error rate over a representative time period. Filter by different operation types to understand performance characteristics of various query patterns. +**Outcome**: Team establishes realistic SLAs based on actual performance data and identifies operations that may need optimization. + +### Use Case 2: Incident Detection +**Scenario**: An on-call engineer receives an alert and needs to quickly assess the situation. +**How it works**: Open Metrics Analytics to see current request rate (checking for traffic spikes or drops), P95 latency (identifying slowdowns), and error rate (quantifying failure impact). Use the error rate over time chart to identify when the issue started. +**Outcome**: Engineer quickly understands incident scope and timeline, enabling faster triage and communication. + +### Use Case 3: Client-Specific Performance Analysis +**Scenario**: A mobile team reports that their app is experiencing slow GraphQL responses. +**How it works**: Filter metrics by the mobile client name and version to isolate their traffic. Compare P95 latency and error rate against other clients to determine if the issue is client-specific or systemic. +**Outcome**: Team determines whether the issue is specific to the mobile client's queries or a broader system problem, directing investigation appropriately. + +--- + +## Technical Summary + +### How It Works +Metrics Analytics aggregates telemetry data from the Cosmo Router collected via OpenTelemetry instrumentation. The system calculates request rate (average requests per minute), P95 latency (95th percentile response time), and error rate (percentage of 4xx and 5xx responses) for the selected time range. Data is visualized in charts showing trends over time. + +### Key Technical Features +- Request rate (requests per minute) calculation +- P95 latency percentile computation +- Error rate aggregation including 4xx and 5xx responses +- Time-series visualization of errors and requests +- Filtering by operation name, client name, and client version + +### Integration Points +- Cosmo Router (data source) +- OpenTelemetry (instrumentation) +- Cosmo Studio (visualization interface) + +### Requirements & Prerequisites +- Cosmo Router with OTEL instrumentation enabled +- Sufficient traffic to generate meaningful metrics +- Client headers configured for client-specific filtering + +--- + +## Documentation References + +- Primary docs: `/docs/studio/analytics/metrics` +- Analytics overview: `/docs/studio/analytics` +- Traces documentation: `/docs/studio/analytics/traces` +- Client identification: `/docs/studio/analytics/client-identification` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL metrics +- API performance metrics +- Request rate monitoring + +### Secondary Keywords +- P95 latency tracking +- GraphQL error rate +- Federation performance analytics + +### Related Search Terms +- GraphQL latency monitoring +- API throughput metrics +- GraphQL error tracking +- Federation request metrics + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/analytics/operations-tracking.md b/capabilities/analytics/operations-tracking.md new file mode 100644 index 00000000..22c7a0a0 --- /dev/null +++ b/capabilities/analytics/operations-tracking.md @@ -0,0 +1,161 @@ +# Operations Tracking + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-operations-tracking` | +| **Category** | Analytics | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-analytics-dashboard`, `cap-trace-analytics`, `cap-schema-field-usage`, `cap-client-identification` | + +--- + +## Quick Reference + +### Name +Operations Tracking + +### Tagline +Monitor and analyze every operation with detailed insights. + +### Elevator Pitch +Operations Tracking provides a comprehensive view into all GraphQL operations executed against your federated graph. Monitor performance metrics, identify deprecated field usage, track client consumption patterns, and navigate seamlessly to detailed traces. Make safe schema changes with data-driven insights and debug performance issues by sorting operations by latency or error rate. + +--- + +## Problem & Solution + +### The Problem +Managing a federated GraphQL API requires understanding which operations are being executed, how they perform, and which clients use them. Without operation-level visibility, teams struggle to prioritize optimization efforts, plan schema migrations, or identify problematic queries impacting system reliability. + +### The Solution +Cosmo's Operations Tracking centralizes all operation data into a searchable, filterable interface. Teams can sort operations by request count, latency, or error rate to identify priorities. Deprecated field indicators highlight operations needing migration attention. Client usage data enables targeted communication. Direct navigation to traces enables deep-dive debugging. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Unknown operation inventory | Complete list of all executed operations | +| Guessing which operations cause issues | Sort by latency or error rate to find problems | +| Manual tracking of deprecated field usage | Automatic indicators for deprecated field usage | +| Disconnected metrics and traces | Seamless navigation from operations to traces | + +--- + +## Key Benefits + +1. **Complete Operation Inventory**: See every operation executed against your federated graph +2. **Performance Prioritization**: Sort by latency, request count, or error rate to focus optimization efforts +3. **Deprecated Field Tracking**: Identify operations using deprecated fields and affected clients +4. **Client Usage Visibility**: Track which clients use each operation for impact analysis +5. **Integrated Debugging**: Navigate directly to traces with pre-applied filters for deep investigation + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Owner +- **Pain Points**: Difficulty prioritizing optimization; uncertainty about schema change impact +- **Goals**: Maintain API performance; evolve schema safely; understand operation usage + +### Secondary Personas +- Backend developers debugging specific operations +- Engineering managers planning capacity and migrations +- SREs identifying problematic operations during incidents + +--- + +## Use Cases + +### Use Case 1: Performance Optimization Prioritization +**Scenario**: An engineering team has limited time for optimization and needs to identify which operations will have the most impact. +**How it works**: Open the Operations page and sort by request count to see highest-traffic operations. Then sort by latency to identify slowest operations. Cross-reference to find high-traffic, high-latency operations that would benefit most from optimization. +**Outcome**: Team prioritizes optimization efforts based on data, maximizing impact of limited engineering time. + +### Use Case 2: Schema Migration Planning +**Scenario**: A team is planning to remove deprecated fields and needs to understand migration scope. +**How it works**: Use the deprecated fields filter to show only operations using deprecated fields. For each operation, view the client usage data to understand which teams need to migrate. Use this data to create a migration timeline and communication plan. +**Outcome**: Team creates a comprehensive migration plan with clear timelines and targeted client communication. + +### Use Case 3: Incident Investigation +**Scenario**: An incident is detected with elevated error rates and the team needs to identify affected operations. +**How it works**: Open the Operations page and sort by error rate in descending order. Identify operations with highest error rates. Click through to traces with pre-applied filters to investigate specific failures. View the operation content to understand the query structure. +**Outcome**: Team quickly identifies problematic operations and traces specific failures for root cause analysis. + +### Use Case 4: Client Impact Analysis +**Scenario**: Before making a change to an operation's behavior, the team needs to understand client impact. +**How it works**: Select the target operation and view the client usage section. See all clients using this operation along with their request counts. Identify high-usage clients for proactive communication about the upcoming change. +**Outcome**: Team communicates changes to affected clients before deployment, preventing unexpected breakage. + +--- + +## Technical Summary + +### How It Works +Operations Tracking aggregates data from all requests processed by the Cosmo Router. Operations are identified by name (or marked as "Unnamed Operation") and type (Query, Mutation, Subscription). Performance metrics are calculated for each operation, and usage is correlated with client identification headers. Deprecated field usage is determined by comparing operations against the current schema. + +### Key Technical Features +- Searchable, filterable operation list +- Search by operation name or hash +- Sort by request count, latency, or error rate +- Deprecated fields filter and indicators +- Client name filtering +- Flexible date range selection with retention limit awareness +- Two-panel layout: operation list and detail view +- Client usage breakdown per operation +- Performance charts (request rate, P95 latency, error percentage) +- Direct navigation to traces with pre-applied filters +- Operation content inspection with syntax highlighting + +### Integration Points +- Cosmo Router (operation data collection) +- Analytics Traces (navigation with filters) +- Schema (deprecated field detection) +- Client identification (usage attribution) + +### Requirements & Prerequisites +- Cosmo Router deployed with telemetry enabled +- Client identification headers for client usage tracking +- Schema with deprecation directives for deprecated field detection + +--- + +## Documentation References + +- Primary docs: `/docs/studio/operations` +- Analytics overview: `/docs/studio/analytics` +- Traces documentation: `/docs/studio/analytics/traces` +- Schema Field Usage: `/docs/studio/analytics/schema-field-usage` +- Client identification: `/docs/studio/analytics/client-identification` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL operations monitoring +- API operation analytics +- GraphQL query tracking + +### Secondary Keywords +- Operation performance metrics +- Deprecated field tracking +- GraphQL operation debugging + +### Related Search Terms +- GraphQL operation list +- API operation performance +- Track GraphQL queries +- Monitor GraphQL mutations +- GraphQL operation client usage + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/analytics/schema-field-usage.md b/capabilities/analytics/schema-field-usage.md new file mode 100644 index 00000000..bf571706 --- /dev/null +++ b/capabilities/analytics/schema-field-usage.md @@ -0,0 +1,149 @@ +# Schema Field Usage + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-schema-field-usage` | +| **Category** | Analytics | +| **Status** | GA | +| **Availability** | Pro / Enterprise | +| **Related Capabilities** | `cap-analytics-dashboard`, `cap-operations-tracking`, `cap-client-identification` | + +--- + +## Quick Reference + +### Name +Schema Field Usage + +### Tagline +Track field popularity and detect unused schema fields. + +### Elevator Pitch +Schema Field Usage enables teams to evolve their GraphQL schema with confidence by providing detailed insights into how every field is used. See which clients and operations use each field, track request counts, identify first and last usage timestamps, and understand which subgraphs contribute to field resolution. Make data-driven decisions about deprecation and removal. + +--- + +## Problem & Solution + +### The Problem +Evolving a GraphQL schema in production is risky. Teams don't know which fields are actively used, which clients depend on specific fields, or when it's safe to remove deprecated fields. Without usage data, schema changes become guesswork that can break client applications and damage user trust. + +### The Solution +Cosmo's Schema Field Usage provides comprehensive visibility into field-level usage across your entire federated graph. For every field, see exactly which clients use it, which operations include it, and how many requests touch it. Date-based filtering shows usage trends, while first/last seen timestamps reveal field lifecycle patterns. This data empowers teams to safely evolve their schema. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Guessing which fields are safe to deprecate | Data-driven deprecation decisions | +| No visibility into client-specific field usage | Per-client breakdown of field consumption | +| Unknown impact of schema changes | Clear understanding of which operations use each field | +| Manual tracking of field lifecycle | Automatic first seen / last seen timestamps | + +--- + +## Key Benefits + +1. **Safe Schema Evolution**: Know exactly which fields are used before making changes +2. **Client-Aware Decisions**: See which clients depend on specific fields for targeted communication +3. **Operation Visibility**: Understand which operations use each field for impact analysis +4. **Usage Tracking**: Monitor request counts to identify popular vs. neglected fields +5. **Lifecycle Insights**: First seen and last seen timestamps reveal field usage patterns over time + +--- + +## Target Audience + +### Primary Persona +- **Role**: API / Schema Designer +- **Pain Points**: Uncertainty about field usage; risk of breaking clients with schema changes +- **Goals**: Evolve schema safely; deprecate fields with confidence; maintain clean, efficient schema + +### Secondary Personas +- Platform engineers managing federated graphs +- Engineering managers planning schema migrations +- Technical leads making deprecation decisions + +--- + +## Use Cases + +### Use Case 1: Safe Field Deprecation +**Scenario**: A team wants to deprecate an old field that has been replaced by a new implementation. +**How it works**: Navigate to Schema Field Usage and search for the target field. Review which clients and operations still use it. Check the last seen timestamp to understand recent activity. Contact client teams if necessary before deprecation. +**Outcome**: Team deprecates the field with full knowledge of impact and clear communication to affected stakeholders. + +### Use Case 2: Unused Field Discovery +**Scenario**: A team wants to clean up their schema by removing fields that are no longer used. +**How it works**: Use Schema Field Usage with a date range covering the past 90 days. Identify fields with zero requests or fields where the last seen timestamp is very old. Cross-reference with client teams before removal. +**Outcome**: Team identifies and safely removes unused fields, reducing schema complexity and maintenance burden. + +### Use Case 3: Client Migration Planning +**Scenario**: A field is being removed and the team needs to coordinate migration with client teams. +**How it works**: View Schema Field Usage for the affected field. Get a complete list of clients using the field along with their request counts. Identify the most impacted clients for prioritized outreach. Set a deprecation timeline based on client usage patterns. +**Outcome**: Team creates a targeted migration plan with clear communication to each affected client team, prioritized by impact. + +--- + +## Technical Summary + +### How It Works +Schema Field Usage aggregates field-level usage data from requests processed by the Cosmo Router. Each GraphQL request is analyzed to identify which fields are accessed. This data is correlated with client information (from headers) and operation details to build a comprehensive usage profile for every field in the schema. The data is accessible from both the schema explorer and schema check page. + +### Key Technical Features +- Per-field usage breakdown by client and operation +- Request count tracking per client per field +- First seen and last seen timestamps +- Subgraph contribution visibility (which subgraphs serve each field) +- Date and time filtering for trend analysis +- Available for all GraphQL types + +### Integration Points +- Cosmo Router (usage data collection) +- Schema Explorer (field usage access point) +- Schema Check page (usage context during checks) +- Client identification headers (client attribution) + +### Requirements & Prerequisites +- Cosmo Router with telemetry enabled +- Client applications configured with identification headers +- Sufficient traffic history for meaningful usage data + +--- + +## Documentation References + +- Primary docs: `/docs/studio/analytics/schema-field-usage` +- Analytics overview: `/docs/studio/analytics` +- Client identification: `/docs/studio/analytics/client-identification` +- Schema checks: `/docs/studio/schema-checks` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL field usage analytics +- Schema field tracking +- API field usage monitoring + +### Secondary Keywords +- GraphQL deprecation planning +- Schema evolution analytics +- Field usage per client + +### Related Search Terms +- GraphQL unused field detection +- Schema field popularity tracking +- GraphQL client field usage +- Safe schema deprecation + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/analytics/trace-analytics.md b/capabilities/analytics/trace-analytics.md new file mode 100644 index 00000000..25f1be39 --- /dev/null +++ b/capabilities/analytics/trace-analytics.md @@ -0,0 +1,147 @@ +# Trace Analytics + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-trace-analytics` | +| **Category** | Analytics | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-analytics-dashboard`, `cap-metrics-analytics`, `cap-operations-tracking` | + +--- + +## Quick Reference + +### Name +Trace Analytics + +### Tagline +Inspect every request with detailed trace visualization. + +### Elevator Pitch +Trace Analytics lists all requests made to your router with detailed information including the operation performed, the requesting client, and any error messages. With powerful filtering, flexible grouping, and auto-refresh capabilities, teams can investigate individual requests, identify patterns, and debug issues in real-time. + +--- + +## Problem & Solution + +### The Problem +When issues occur in a federated GraphQL system, teams need to investigate individual requests to understand what happened. Without request-level visibility, debugging becomes guesswork, and correlating errors across services is nearly impossible. Teams waste time searching through logs and manually piecing together request flows. + +### The Solution +Cosmo's Trace Analytics provides a comprehensive list of all requests with relevant details visible at a glance. Filters narrow down to specific requests, grouping reveals patterns across operations or clients, and drill-down capabilities let teams inspect individual traces in detail. Auto-refresh keeps the view current during active investigation. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Searching through distributed logs | Centralized list of all requests | +| Manual correlation of request data | Filters and grouping reveal patterns instantly | +| Static log views during incidents | Auto-refresh keeps data current | +| Limited context for each request | Full details including operation, client, and errors | + +--- + +## Key Benefits + +1. **Complete Request Visibility**: See every request made to your federated graph with full details +2. **Powerful Filtering**: Narrow down to specific requests by date range, operation, client, or error +3. **Flexible Grouping**: Group by operation name, client, or error message to identify patterns +4. **Real-Time Monitoring**: Auto-refresh at configurable intervals (10s, 30s, 1min, 5min) +5. **Seamless Drill-Down**: Click on grouped rows to filter and investigate specific request sets + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / Platform Engineer +- **Pain Points**: Difficulty debugging production issues; need visibility into individual request behavior +- **Goals**: Quickly identify and resolve issues; understand request patterns + +### Secondary Personas +- SREs investigating incidents +- QA engineers validating API behavior +- Support teams researching customer-reported issues + +--- + +## Use Cases + +### Use Case 1: Error Pattern Identification +**Scenario**: A team notices elevated error rates and needs to understand what types of errors are occurring. +**How it works**: Open Trace Analytics and group by error message. The view shows all unique error messages clustered together with occurrence counts. Identify the most common errors, then click on an error group to see all individual requests with that error. +**Outcome**: Team quickly identifies that 80% of errors are from a specific error message, directing investigation to the root cause. + +### Use Case 2: Client Behavior Investigation +**Scenario**: A specific mobile app version is suspected of causing issues due to malformed queries. +**How it works**: Use Trace Analytics filters to select the specific client name and version. Review the operations being executed by this client, check for errors, and compare behavior to other client versions. +**Outcome**: Team confirms that a specific client version is sending problematic queries and can work with the mobile team to deploy a fix. + +### Use Case 3: Real-Time Incident Monitoring +**Scenario**: During an active incident, the team needs to monitor requests in real-time to understand if a deployed fix is working. +**How it works**: Configure auto-refresh to 10-second intervals. Apply filters relevant to the incident (e.g., specific operation or error type). Watch the trace list update automatically to see if error patterns decrease after the fix is deployed. +**Outcome**: Team confirms fix effectiveness in real-time without manual refreshing, enabling faster incident resolution. + +--- + +## Technical Summary + +### How It Works +Trace Analytics collects request data from the Cosmo Router through OTEL instrumentation. The sampling rate configured on the router determines how many traces are captured. Each trace includes operation details, client information (from headers), timing data, and error information. The UI provides filtering, grouping, and pagination capabilities for exploring this data. + +### Key Technical Features +- Date range selection with predefined ranges and custom date/time picker +- Filtering by operation, client, status, and other attributes +- Grouping by: None (individual requests), Operation Name, Client, Error Message +- Auto-refresh intervals: 10 seconds, 30 seconds, 1 minute, 5 minutes +- Click-through from grouped views to filtered individual traces + +### Integration Points +- Cosmo Router (data collection via OTEL) +- Client applications (via GraphQL-Client-Name and GraphQL-Client-Version headers) +- Cosmo Studio (visualization interface) + +### Requirements & Prerequisites +- Cosmo Router with OTEL instrumentation enabled +- Sampling rate configured to capture desired trace volume +- Client headers configured for proper client identification + +--- + +## Documentation References + +- Primary docs: `/docs/studio/analytics/traces` +- Analytics overview: `/docs/studio/analytics` +- Distributed tracing: `/docs/studio/analytics/distributed-tracing` +- Client identification: `/docs/studio/analytics/client-identification` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL request tracing +- API trace analysis +- Request debugging + +### Secondary Keywords +- GraphQL trace filtering +- API request grouping +- Federation trace analytics + +### Related Search Terms +- GraphQL request inspection +- API request debugging tool +- Federation request visibility +- GraphQL error trace analysis + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/cli/cosmo-cli.md b/capabilities/cli/cosmo-cli.md new file mode 100644 index 00000000..a53a10aa --- /dev/null +++ b/capabilities/cli/cosmo-cli.md @@ -0,0 +1,190 @@ +# Cosmo CLI (wgc) + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-cli-001` | +| **Category** | CLI | +| **Status** | GA | +| **Availability** | Free | +| **Related Capabilities** | `cap-fed-001`, `cap-router-001` | + +--- + +## Quick Reference + +### Name +Cosmo CLI (wgc) + +### Tagline +Complete command-line control for federated GraphQL management. + +### Elevator Pitch +The Cosmo CLI (wgc) is a comprehensive command-line tool that enables developers to manage every aspect of their federated GraphQL infrastructure. From creating and publishing subgraphs to validating schema changes and managing router configurations, wgc provides full lifecycle control over your GraphQL APIs directly from your terminal or CI/CD pipelines. + +--- + +## Problem & Solution + +### The Problem +Managing federated GraphQL architectures involves numerous operational tasks: creating graphs, publishing schemas, validating changes, managing namespaces, and configuring routers. Without a unified toolchain, teams struggle with: +- Manual, error-prone schema deployments +- Lack of pre-deployment validation leading to production incidents +- Difficulty integrating GraphQL management into CI/CD workflows +- No centralized way to manage multiple environments and teams + +### The Solution +Cosmo CLI provides a single, powerful command-line interface that handles all aspects of federated GraphQL management. It integrates seamlessly with CI/CD pipelines, enables schema validation before deployment, and provides consistent commands for managing namespaces, subgraphs, federated graphs, routers, and more. + +### Before & After + +| Before Cosmo CLI | With Cosmo CLI | +|------------------|----------------| +| Manual schema deployments through web interfaces | Automated deployments via `wgc subgraph publish` | +| No pre-deployment validation of schema changes | `wgc subgraph check` catches breaking changes before production | +| Complex scripts to manage multiple environments | Simple namespace management with `-n` flag | +| Separate tools for different GraphQL operations | One unified CLI for all operations | +| Difficult CI/CD integration | Native support for environment variables and proxy configuration | + +--- + +## Key Benefits + +1. **Complete Lifecycle Management**: Create, update, publish, check, and delete subgraphs, federated graphs, and monographs all from a single tool. + +2. **CI/CD Native**: Built for automation with environment variable authentication (`COSMO_API_KEY`), proxy support, and machine-readable outputs for seamless pipeline integration. + +3. **Safe Schema Evolution**: Pre-deployment schema checks validate changes against client traffic patterns and composition rules, preventing breaking changes from reaching production. + +4. **Multi-Environment Support**: Namespaces provide isolated environments (dev, staging, prod) manageable through consistent CLI commands with the `-n` flag. + +5. **Extensibility**: Support for plugins, gRPC services, and custom router configurations enables advanced use cases while maintaining a consistent CLI experience. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / DevOps Engineer +- **Pain Points**: Need to automate GraphQL infrastructure management; require reliable CI/CD integration; must manage multiple environments consistently +- **Goals**: Automate schema deployments; implement safe change management; standardize team workflows + +### Secondary Personas +- **Backend Developers**: Need to publish schema changes and validate them against the federated graph +- **API Architects**: Manage federated graph composition and subgraph organization +- **Engineering Managers**: Ensure consistent, auditable processes for schema changes across teams + +--- + +## Use Cases + +### Use Case 1: CI/CD Schema Deployment Pipeline +**Scenario**: A team wants to automate subgraph schema deployments through their CI/CD pipeline with safety checks. + +**How it works**: +1. Developer pushes schema changes to a feature branch +2. CI pipeline runs `wgc subgraph check --schema schema.graphql` to validate changes +3. Check results are reported, including breaking changes and composition errors +4. On merge to main, `wgc subgraph publish --schema schema.graphql` deploys the schema +5. The federated graph automatically recomposes with the new schema + +**Outcome**: Safe, automated schema deployments with pre-merge validation that prevents breaking changes from reaching production. + +### Use Case 2: Multi-Environment Namespace Management +**Scenario**: An organization needs to manage separate development, staging, and production environments for their federated graph. + +**How it works**: +1. Create namespaces: `wgc namespace create staging` and `wgc namespace create production` +2. Create federated graphs in each namespace with matching subgraphs +3. Use the `-n` flag to target specific environments: `wgc subgraph publish products -n staging --schema schema.graphql` +4. Promote changes through environments by publishing to each namespace sequentially + +**Outcome**: Clean environment isolation with consistent commands, enabling safe promotion of changes from dev to production. + +### Use Case 3: Schema Change Validation with Traffic Analysis +**Scenario**: A team needs to understand the impact of a schema change on existing clients before deploying. + +**How it works**: +1. Run `wgc subgraph check products --schema new-schema.graphql` +2. The CLI validates composition with all connected federated graphs +3. Breaking changes are analyzed against actual client traffic patterns +4. Results show which changes are breaking vs. non-breaking and which clients would be affected +5. VCS context can be added via environment variables for traceability + +**Outcome**: Data-driven decision making about schema changes with clear visibility into client impact. + +--- + +## Technical Summary + +### How It Works +The Cosmo CLI (wgc) is distributed via npm and communicates with the Cosmo Control Plane API. Authentication is handled via API keys stored in environment variables. Commands are organized by resource type (subgraph, federated-graph, namespace, router, etc.) with consistent CRUD operations. The CLI supports proxy configurations for enterprise environments and can run behind corporate firewalls. + +### Key Technical Features +- **Authentication**: API key-based auth via `COSMO_API_KEY` environment variable +- **Proxy Support**: `HTTPS_PROXY` and `HTTP_PROXY` environment variables (v0.63.0+) +- **Schema Validation**: Breaking change detection and composition error checking +- **Label-Based Composition**: Flexible subgraph-to-federated-graph mapping using labels +- **VCS Integration**: Environment variables for commit, branch, and author context +- **Multiple Subscription Protocols**: WebSocket (graphql-ws, subscription-transport-ws), SSE, SSE POST + +### Integration Points +- **CI/CD Systems**: GitHub Actions, GitLab CI, Jenkins, CircleCI, etc. +- **Version Control**: Git integration via VCS context environment variables +- **Cosmo Control Plane**: Full API access for graph management +- **Cosmo Router**: Configuration management and token generation +- **Event Systems**: Support for Event-Driven Graphs with Kafka, NATS, Redis + +### Requirements & Prerequisites +- Node.js LTS version installed +- `COSMO_API_KEY` environment variable set +- Network access to Cosmo Control Plane (cloud.wundergraph.com or self-hosted) + +--- + +## Documentation References + +- Primary docs: `/docs/cli/intro` +- Essentials guide: `/docs/cli/essentials` +- Namespace management: `/docs/cli/namespace` +- Subgraph commands: `/docs/cli/subgraph` +- Federated graph commands: `/docs/cli/federated-graph` +- Monograph commands: `/docs/cli/monograph` +- Router commands: `/docs/cli/router` +- Plugin commands: `/docs/cli/router/plugin` +- gRPC service commands: `/docs/cli/grpc-service` +- Operations commands: `/docs/cli/operations` +- Proposal commands: `/docs/cli/proposal` +- Authentication commands: `/docs/cli/auth` +- CI/CD Tutorial: `/docs/tutorial/pr-based-workflow-for-federation` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL CLI +- Federation CLI +- Cosmo CLI +- wgc + +### Secondary Keywords +- GraphQL schema management +- Federated graph CLI +- Subgraph management +- GraphQL CI/CD + +### Related Search Terms +- How to manage federated GraphQL from command line +- GraphQL schema deployment automation +- Federated GraphQL CI/CD integration +- GraphQL namespace management + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/compliance/advanced-data-privacy.md b/capabilities/compliance/advanced-data-privacy.md new file mode 100644 index 00000000..4ff9e77f --- /dev/null +++ b/capabilities/compliance/advanced-data-privacy.md @@ -0,0 +1,164 @@ +# Advanced Data Privacy + +Field-level data obfuscation with custom value renderers for fine-grained access control. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-compliance-003` | +| **Category** | Compliance | +| **Status** | GA | +| **Availability** | Pro / Enterprise | +| **Related Capabilities** | `cap-compliance-001`, `cap-compliance-002`, `cap-compliance-004` | + +--- + +## Quick Reference + +### Name +Advanced Data Privacy + +### Tagline +Field-level data obfuscation for sensitive information protection. + +### Elevator Pitch +Cosmo's Advanced Data Privacy feature enables organizations to implement custom data obfuscation at the field level within GraphQL responses. Using Custom Modules, you can dynamically mask sensitive information like credit card numbers, social security numbers, or personal data based on user roles, request context, or any custom logic. This provides fine-grained data access control without modifying your subgraph implementations. + +--- + +## Problem & Solution + +### The Problem +Organizations often need to restrict access to sensitive data fields based on user roles or context. Developers debugging production issues might need to see query structures without exposing actual customer data. Data scientists analyzing patterns need to work with realistic data structures while protecting PII. AI models interacting with APIs should not have access to sensitive personal information. Implementing these restrictions at the subgraph level requires significant code changes and creates maintenance overhead. + +### The Solution +Cosmo's Custom Modules allow you to implement custom value renderers that intercept and transform GraphQL response data at the Router level. You can obfuscate specific field types (String, Int, Float) or implement sophisticated logic based on authentication claims, user roles, or request context. This provides centralized, consistent data privacy enforcement without touching subgraph code. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Implement data masking in each subgraph | Centralized obfuscation at the Router | +| Complex authorization logic in resolvers | Role-based obfuscation via request context | +| Separate data pipelines for AI/analytics | Same API with dynamic data filtering | +| Code changes required for new privacy rules | Configuration-based privacy policies | + +--- + +## Key Benefits + +1. **Centralized Privacy Control**: Implement data obfuscation once at the Router level, applied consistently across all subgraphs +2. **Role-Based Access**: Dynamically apply different obfuscation rules based on user roles or authentication claims +3. **AI-Safe Data Access**: Protect sensitive information when exposing APIs to AI models or third-party systems +4. **Developer-Friendly Debugging**: Allow developers to debug production queries without exposing actual customer data +5. **No Subgraph Changes**: Add privacy controls without modifying existing subgraph implementations + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer, Security Engineer +- **Pain Points**: Implementing consistent data masking across distributed services; providing safe data access for different user types; protecting data when integrating with AI systems +- **Goals**: Implement field-level access control; enable secure debugging workflows; protect sensitive data in all contexts + +### Secondary Personas +- Data scientists needing to work with anonymized production data +- DevOps engineers setting up safe development environments +- AI/ML engineers integrating GraphQL APIs with models + +--- + +## Use Cases + +### Use Case 1: Developer Debugging with Masked Data +**Scenario**: Developers need to debug production GraphQL queries without seeing actual customer data like names, addresses, or payment information. +**How it works**: Implement a custom value renderer that checks for the "developer" role in authentication claims. When a developer makes a request, all String fields are replaced with "xxx" and numeric fields with placeholder values, preserving query structure while hiding data. +**Outcome**: Developers can debug query execution, response structures, and performance issues without exposure to sensitive customer information. + +### Use Case 2: AI Model Data Access Control +**Scenario**: An organization wants to allow AI models to query their GraphQL API but needs to prevent exposure of sensitive fields like SSN, credit card numbers, or medical information. +**How it works**: Create a custom value renderer that identifies requests from AI systems (via headers or authentication) and obfuscates fields marked as sensitive. The AI model receives realistic response structures with placeholder data for protected fields. +**Outcome**: AI models can interact with the API for non-sensitive use cases while being prevented from accessing or learning from protected data. + +### Use Case 3: Data Science Analytics Pipeline +**Scenario**: Data scientists need to analyze GraphQL query patterns and response structures using production data shapes without accessing actual PII. +**How it works**: Deploy a custom module that applies consistent obfuscation to all responses for data science service accounts. String values become masked, numbers become placeholder values, while maintaining accurate types and field structures. +**Outcome**: Data scientists can develop and test analytics pipelines using production-realistic data without compliance concerns. + +--- + +## Technical Summary + +### How It Works +Cosmo's Custom Modules feature allows you to implement a `CustomFieldValueRenderer` interface that intercepts response data before it's sent to the client. The renderer receives each field value with its GraphQL type information and can transform or replace the value based on any logic you implement. This is combined with the `RouterOnRequest` hook to apply different renderers based on request context. + +### Key Technical Features +- Custom value renderer interface for field-level control +- Access to GraphQL type information (String, Int, Float, custom types) +- Integration with authentication claims and request context +- Per-request renderer selection via `RouterOnRequest` hook +- Written in Go for high performance + +### Code Example +```go +func (c *CustomValueRenderer) RenderFieldValue(ctx *resolve.Context, value resolve.FieldValue, out io.Writer) error { + switch value.Type { + case "String": + _, err = out.Write([]byte(`"xxx"`)) + case "Int", "Float": + _, err = out.Write([]byte(`123`)) + default: + _, err = out.Write(value.Data) + } + return err +} +``` + +### Integration Points +- Cosmo Router Custom Modules system +- Authentication/Authorization middleware +- Request context for conditional logic + +### Requirements & Prerequisites +- Custom Module development capability (Go) +- Understanding of GraphQL response structure +- Router deployment with custom module support + +--- + +## Documentation References + +- Primary docs: `/docs/router/advanced-data-privacy` +- Custom Modules guide: `/docs/router/custom-modules` +- Router configuration: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL data masking +- Field-level obfuscation +- PII protection GraphQL + +### Secondary Keywords +- Data privacy API +- Role-based data access +- AI data protection + +### Related Search Terms +- How to mask sensitive data in GraphQL +- GraphQL field-level security +- Obfuscate GraphQL response data + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/compliance/compliance-certifications.md b/capabilities/compliance/compliance-certifications.md new file mode 100644 index 00000000..c506f4bf --- /dev/null +++ b/capabilities/compliance/compliance-certifications.md @@ -0,0 +1,151 @@ +# Compliance Certifications + +Enterprise-grade security and regulatory compliance for GraphQL APIs. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-compliance-001` | +| **Category** | Compliance | +| **Status** | GA | +| **Availability** | Pro / Enterprise | +| **Related Capabilities** | `cap-compliance-002`, `cap-compliance-003`, `cap-compliance-004` | + +--- + +## Quick Reference + +### Name +Compliance Certifications + +### Tagline +Enterprise-grade security certifications for regulated industries. + +### Elevator Pitch +Cosmo provides comprehensive compliance certifications including SOC 2 Type II, ISO 27001, GDPR, and HIPAA support, enabling organizations in regulated industries to deploy federated GraphQL with confidence. With built-in security controls, audit logging, and privacy safeguards, Cosmo meets the highest industry standards for data protection. + +--- + +## Problem & Solution + +### The Problem +Organizations in regulated industries (healthcare, finance, government) face significant barriers when adopting new API technologies. They need to demonstrate compliance with multiple frameworks (SOC 2, ISO 27001, GDPR, HIPAA) to auditors, customers, and partners. Building and maintaining compliant GraphQL infrastructure in-house requires substantial investment in security controls, documentation, and ongoing audit processes. + +### The Solution +Cosmo comes with pre-built compliance certifications and security controls that satisfy enterprise requirements out of the box. The platform implements security by design with continuous fuzz testing, cryptographic configuration validation, and strict data privacy controls. Organizations can leverage Cosmo's existing certifications and compliance documentation to accelerate their own audit processes. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Months spent building compliant GraphQL infrastructure | Deploy with enterprise-grade compliance from day one | +| Expensive security audits for custom solutions | Leverage existing SOC 2 Type II certification | +| Uncertainty about data privacy guarantees | Built-in privacy safeguards that cannot be bypassed | +| Complex compliance documentation requirements | Comprehensive compliance reports available upon request | + +--- + +## Key Benefits + +1. **Accelerated Compliance**: Leverage Cosmo's existing SOC 2 Type II certification to speed up your own audit processes +2. **Security by Design**: Continuous fuzz testing, cryptographic validation, and HMAC-SHA256 signing prevent tampering and ensure configuration integrity +3. **Data Privacy Guarantees**: Request/response data never leaves your infrastructure, with built-in anonymization that cannot be bypassed +4. **Regulatory Coverage**: Support for GDPR, HIPAA, and ISO 27001 frameworks enables deployment in highly regulated industries +5. **Enterprise Insurance**: $5M E&O and Cyber Insurance Coverage provides additional risk mitigation + +--- + +## Target Audience + +### Primary Persona +- **Role**: Security/Compliance Officer, CISO +- **Pain Points**: Demonstrating compliance to auditors; ensuring new technologies meet regulatory requirements; managing vendor risk +- **Goals**: Adopt modern API technologies without compromising compliance posture; reduce time to compliance certification + +### Secondary Personas +- Platform Engineers in regulated industries who need to deploy compliant infrastructure +- CTOs and VPs of Engineering evaluating GraphQL federation for enterprise adoption +- Legal and Procurement teams assessing vendor compliance + +--- + +## Use Cases + +### Use Case 1: Healthcare Organization Deploying GraphQL +**Scenario**: A healthcare company wants to modernize their API layer with GraphQL federation but must maintain HIPAA compliance for handling PHI. +**How it works**: Deploy Cosmo with default privacy settings that anonymize IP addresses and prevent sensitive data from leaving the infrastructure. Leverage Cosmo's HIPAA-compliant infrastructure and compliance documentation for audit purposes. +**Outcome**: GraphQL federation deployed in production within compliance requirements, with audit documentation ready for regulators. + +### Use Case 2: Financial Services SOC 2 Audit +**Scenario**: A fintech company needs to pass SOC 2 Type II audit and their GraphQL infrastructure is in scope. +**How it works**: Use Cosmo's existing SOC 2 Type II certification as part of vendor compliance. Access Cosmo's SOC 2 report upon request to demonstrate vendor due diligence. Leverage built-in audit logging and RBAC controls. +**Outcome**: Vendor compliance section of SOC 2 audit completed efficiently with documented controls and certifications. + +### Use Case 3: GDPR Compliance for European Operations +**Scenario**: An organization operating in Europe needs to ensure their GraphQL infrastructure complies with GDPR data protection requirements. +**How it works**: Deploy Cosmo with self-hosted Router to keep all request data within your infrastructure. Configure IP anonymization (redact or hash) to minimize PII collection. Use namespace isolation to segregate data by region if needed. +**Outcome**: GraphQL infrastructure deployed in compliance with GDPR requirements for data minimization and privacy by design. + +--- + +## Technical Summary + +### How It Works +Cosmo implements a comprehensive security framework with multiple layers of protection. The Router can be self-hosted to keep all request/response data within your infrastructure, while only anonymized metadata is sent to the Control Plane for analytics. Configuration updates are cryptographically validated using HMAC-SHA256 signatures to prevent tampering. Access is controlled through RBAC and SSO (OIDC/SAML) integration. + +### Key Technical Features +- SOC 2 Type II certified security controls +- HMAC-SHA256 configuration signing and validation +- IP anonymization enabled by default (redact or hash options) +- SSO support via OIDC and SAML +- Role-Based Access Control (RBAC) +- Continuous fuzz testing for vulnerabilities +- Webhook signature verification (SHA-256) +- Domain-based subgraph URL validation + +### Integration Points +- OIDC/SAML identity providers for SSO +- Existing audit and SIEM systems via logging +- OpenTelemetry-compatible observability stack + +### Requirements & Prerequisites +- Enterprise plan for full compliance features +- Self-hosted Router deployment for maximum data isolation (hybrid deployment) + +--- + +## Documentation References + +- Primary docs: `/docs/security-and-compliance` +- Router compliance: `/docs/router/compliance-and-data-management` +- Config validation: `/docs/router/security/config-validation-and-signing` + +--- + +## Keywords & SEO + +### Primary Keywords +- SOC 2 GraphQL +- HIPAA compliant GraphQL +- Enterprise GraphQL compliance + +### Secondary Keywords +- GDPR GraphQL API +- ISO 27001 API management +- Compliant federation + +### Related Search Terms +- GraphQL security certifications +- Regulated industry GraphQL +- Enterprise API compliance + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/compliance/ip-anonymization.md b/capabilities/compliance/ip-anonymization.md new file mode 100644 index 00000000..79b7e2c6 --- /dev/null +++ b/capabilities/compliance/ip-anonymization.md @@ -0,0 +1,160 @@ +# IP Anonymization + +Protect user privacy by automatically redacting or hashing IP addresses in telemetry and logs. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-compliance-002` | +| **Category** | Compliance | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-compliance-001`, `cap-compliance-003`, `cap-compliance-004` | + +--- + +## Quick Reference + +### Name +IP Anonymization + +### Tagline +Privacy-first telemetry with automatic IP redaction. + +### Elevator Pitch +Cosmo Router automatically anonymizes IP addresses by default, ensuring user privacy without any configuration. Choose between complete redaction for maximum privacy or hashing for anonymous analytics. This built-in feature ensures compliance with privacy regulations like GDPR while maintaining full observability capabilities. + +--- + +## Problem & Solution + +### The Problem +IP addresses are considered personally identifiable information (PII) under privacy regulations like GDPR. Organizations collecting telemetry data, logs, or analytics from their GraphQL APIs face the challenge of maintaining useful observability while protecting user privacy. Manual anonymization is error-prone and often overlooked, creating compliance risks. + +### The Solution +Cosmo Router enables IP anonymization by default, ensuring that no raw IP addresses are exported from your infrastructure. With a simple configuration option, you can choose between complete redaction (removing IP addresses entirely) or hashing (allowing anonymous user tracking for analytics while preserving privacy). + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manual implementation of IP anonymization | Automatic, enabled by default | +| Risk of accidentally exposing user IPs | Privacy safeguards built into the platform | +| Complex configuration for GDPR compliance | Simple one-line configuration | +| Inconsistent anonymization across services | Consistent privacy protection at the gateway | + +--- + +## Key Benefits + +1. **Privacy by Default**: IP anonymization is enabled out of the box, no configuration required for basic protection +2. **Flexible Anonymization Methods**: Choose between redaction (complete removal) or hashing (anonymous tracking) based on your needs +3. **Regulatory Compliance**: Built-in support for GDPR data minimization requirements +4. **Zero Performance Impact**: Anonymization happens efficiently at the router level without affecting request latency +5. **Consistent Protection**: All telemetry, logs, and analytics are protected uniformly + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer, DevOps Engineer +- **Pain Points**: Implementing privacy controls across distributed systems; ensuring consistent anonymization; maintaining compliance without sacrificing observability +- **Goals**: Deploy privacy-compliant infrastructure with minimal configuration; meet regulatory requirements efficiently + +### Secondary Personas +- Data Protection Officers ensuring GDPR compliance +- Security engineers implementing privacy controls +- Compliance teams auditing data collection practices + +--- + +## Use Cases + +### Use Case 1: GDPR-Compliant Analytics +**Scenario**: A European e-commerce company needs to collect API analytics while complying with GDPR data minimization principles. +**How it works**: Deploy Cosmo Router with default IP anonymization settings. All analytics and telemetry automatically exclude raw IP addresses. For user journey analysis, switch to hash mode to enable anonymous user tracking. +**Outcome**: Full analytics capabilities maintained while meeting GDPR requirements for data minimization. + +### Use Case 2: Privacy-First Logging +**Scenario**: An organization needs to maintain detailed request logs for debugging but wants to minimize PII in log files. +**How it works**: Cosmo Router's default redact mode removes IP addresses from all request logs. Logs contain all other useful debugging information (user agent, request path, response status, latency) without exposing user IP addresses. +**Outcome**: Comprehensive request logging for operational needs without PII exposure. + +### Use Case 3: Anonymous User Analytics +**Scenario**: A SaaS platform wants to analyze user behavior patterns across their GraphQL API without storing identifiable information. +**How it works**: Configure IP anonymization with hash mode. IP addresses are converted to consistent hashes, allowing user session tracking and behavior analysis without storing the actual IP address. +**Outcome**: Rich analytics on user behavior patterns while maintaining user privacy and compliance. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router intercepts all incoming requests and applies IP anonymization before the IP address is used in any telemetry, logging, or analytics. Two methods are available: + +- **Redact**: Completely removes the IP address, replacing it with a placeholder value +- **Hash**: Applies a one-way hash function to the IP address, producing a consistent anonymous identifier + +### Key Technical Features +- Enabled by default with redact mode +- Configurable via YAML configuration or environment variables +- Applied consistently across all telemetry exports (OTEL traces, metrics) +- Applied to request logging +- No plaintext IP addresses in exported data + +### Configuration Example +```yaml +compliance: + anonymize_ip: + enabled: true + method: redact # or "hash" +``` + +### Integration Points +- OpenTelemetry trace exports +- OpenTelemetry metric exports +- Request logging system +- Cosmo Analytics + +### Requirements & Prerequisites +- Cosmo Router deployment +- No additional configuration required for default behavior + +--- + +## Documentation References + +- Primary docs: `/docs/router/compliance-and-data-management` +- Router configuration: `/docs/router/configuration` +- Security overview: `/docs/security-and-compliance` + +--- + +## Keywords & SEO + +### Primary Keywords +- IP anonymization +- GraphQL privacy +- GDPR compliant API + +### Secondary Keywords +- PII protection +- Privacy by design +- Data anonymization + +### Related Search Terms +- How to anonymize IP addresses in GraphQL +- GDPR GraphQL logging +- Privacy compliant telemetry + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/compliance/variable-export-control.md b/capabilities/compliance/variable-export-control.md new file mode 100644 index 00000000..d82290d6 --- /dev/null +++ b/capabilities/compliance/variable-export-control.md @@ -0,0 +1,160 @@ +# Variable Export Control + +Control which GraphQL variables are exported in telemetry for privacy and security. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-compliance-004` | +| **Category** | Compliance | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-compliance-001`, `cap-compliance-002`, `cap-compliance-003` | + +--- + +## Quick Reference + +### Name +Variable Export Control + +### Tagline +Opt-in variable export for secure query replay. + +### Elevator Pitch +Cosmo provides granular control over whether GraphQL variables are included in telemetry exports. By default, variables are excluded to prevent sensitive data from being captured in traces. When you need query replay capabilities for debugging, you can explicitly opt in to variable export, maintaining a security-first approach while enabling powerful debugging workflows. + +--- + +## Problem & Solution + +### The Problem +GraphQL variables often contain sensitive request data such as user inputs, authentication tokens, personal information, or business-critical parameters. When telemetry data (traces, logs) captures these variables, it can create compliance risks, security vulnerabilities, and privacy violations. Organizations need to balance the debugging value of seeing exact query parameters against the risk of exposing sensitive data in observability systems. + +### The Solution +Cosmo Router excludes GraphQL variables from telemetry exports by default, ensuring sensitive data never leaves your infrastructure through observability channels. When debugging requires exact query reproduction, you can explicitly enable variable export through configuration. This opt-in approach maintains security by default while providing flexibility when needed. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Variables captured by default in traces | Variables excluded by default | +| Risk of sensitive data in observability systems | Security-first telemetry configuration | +| Manual filtering of variable data | Automatic exclusion, opt-in inclusion | +| Difficult to reproduce queries without variables | Enable variable export when needed for debugging | + +--- + +## Key Benefits + +1. **Secure by Default**: Variables are excluded from telemetry exports by default, preventing accidental data exposure +2. **Explicit Opt-In**: Enable variable export only when needed, maintaining conscious control over data exposure +3. **Query Replay Support**: When enabled, variables are captured to support exact query reproduction in Cosmo Studio +4. **Compliance Friendly**: Default behavior aligns with data minimization principles required by GDPR and other regulations +5. **Simple Configuration**: Single configuration option controls variable export behavior + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer, DevOps Engineer +- **Pain Points**: Balancing debugging capabilities with security requirements; preventing sensitive data from appearing in logs and traces; maintaining compliance while enabling observability +- **Goals**: Secure telemetry configuration by default; ability to enable detailed debugging when needed + +### Secondary Personas +- Security engineers auditing telemetry data exposure +- Compliance officers ensuring data minimization +- Developers needing query replay for debugging + +--- + +## Use Cases + +### Use Case 1: Production Environment with Sensitive Data +**Scenario**: A financial services company processes transactions through their GraphQL API, with variables containing account numbers, amounts, and user identifiers. +**How it works**: Deploy Cosmo Router with default settings. Variables are automatically excluded from all telemetry exports. Traces capture operation names, timing, and structure without exposing sensitive transaction data. +**Outcome**: Full observability into API performance and errors without risk of sensitive financial data appearing in telemetry systems. + +### Use Case 2: Debugging Environment with Query Replay +**Scenario**: A development team needs to reproduce and debug failing queries from production, requiring access to exact variable values. +**How it works**: Enable variable export in a dedicated debugging environment or during specific debugging sessions. Variables are captured in traces, enabling the Studio's query replay feature to reproduce exact queries. +**Outcome**: Developers can reproduce exact production scenarios for debugging while maintaining secure defaults in production. + +### Use Case 3: Compliance Audit for Data Handling +**Scenario**: An organization needs to demonstrate to auditors that their API infrastructure does not capture or transmit sensitive customer data unnecessarily. +**How it works**: Show auditors the default Cosmo Router configuration where variable export is disabled. Demonstrate that operation content is normalized (user data removed) and variables are excluded from all telemetry. +**Outcome**: Clear audit trail showing data minimization practices are enforced by default at the infrastructure level. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router processes GraphQL requests and exports telemetry data to OpenTelemetry-compatible systems. By default, the `export_graphql_variables` setting is disabled, meaning GraphQL variables are stripped from trace spans before export. Operation content is also normalized to remove any embedded user data. When explicitly enabled, variables are included in trace attributes, enabling query replay features in Cosmo Studio. + +### Configuration Example +```yaml +version: "1" +telemetry: + tracing: + export_graphql_variables: true # Default: false +``` + +Environment variable: `TRACING_EXPORT_GRAPHQL_VARIABLES` + +### Key Technical Features +- Default exclusion of variables from OTEL traces +- Operation content normalization (user data removed) +- Single configuration toggle for variable export +- Environment variable support for deployment flexibility +- Integration with Cosmo Studio query replay + +### Integration Points +- OpenTelemetry trace exports +- Cosmo Studio distributed tracing +- Studio query replay feature + +### Requirements & Prerequisites +- Cosmo Router deployment +- Telemetry export configuration +- Studio access for query replay features (when variables enabled) + +--- + +## Documentation References + +- Primary docs: `/docs/router/compliance-and-data-management` +- Tracing configuration: `/docs/router/observability/tracing` +- Studio distributed tracing: `/docs/studio/analytics/distributed-tracing` +- Router configuration: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL variable privacy +- Telemetry data control +- OTEL variable filtering + +### Secondary Keywords +- Query replay security +- Trace data minimization +- GraphQL tracing privacy + +### Related Search Terms +- How to exclude variables from GraphQL traces +- Secure GraphQL telemetry +- GDPR compliant GraphQL tracing + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/deployment/cluster-management.md b/capabilities/deployment/cluster-management.md new file mode 100644 index 00000000..e9dabb70 --- /dev/null +++ b/capabilities/deployment/cluster-management.md @@ -0,0 +1,140 @@ +# Cluster Management + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-deploy-008` | +| **Category** | Deployment | +| **Status** | GA | +| **Availability** | Pro / Enterprise | +| **Related Capabilities** | `cap-deploy-001`, `cap-deploy-007` | + +--- + +## Quick Reference + +### Name +Router Cluster Management + +### Tagline +Monitor and track all running routers from Studio. + +### Elevator Pitch +Router Cluster Management provides centralized visibility into all your running router instances through Cosmo Studio. Monitor vital metrics like CPU and memory utilization, verify deployed graph compositions, and track router versions and uptime across your entire fleet. Automated OpenTelemetry instrumentation means zero additional setup. + +--- + +## Problem & Solution + +### The Problem +Operating a fleet of GraphQL routers across multiple environments makes it difficult to maintain visibility into overall health. Teams struggle to answer basic questions: How many routers are running? What versions are deployed? Which composition is active? Are there any resource issues? Without centralized visibility, operational issues go undetected. + +### The Solution +Cosmo Studio's Cluster Management dashboard automatically displays all running router instances using data from built-in OpenTelemetry instrumentation. View real-time health status, resource utilization, deployed versions, and graph compositions from a single interface. Group routers by logical clusters and drill into individual instance details. + +### Before & After + +| Before Cluster Management | With Cluster Management | +|--------------------------|------------------------| +| Manual tracking of router instances | Automatic discovery and display | +| Scattered monitoring across tools | Centralized dashboard in Studio | +| Unknown deployment versions | Clear version visibility per instance | +| Resource issues discovered late | Real-time CPU/memory monitoring | +| No composition verification | Confirm deployed graph composition | + +--- + +## Key Benefits + +1. **Automatic Discovery**: Routers automatically appear in the dashboard through OpenTelemetry instrumentation +2. **Real-Time Vitals**: Monitor CPU and memory utilization with trend indicators +3. **Version Visibility**: See deployed router versions across your entire fleet +4. **Composition Verification**: Confirm which graph composition each router is running +5. **Logical Clustering**: Group routers by cluster name for organized fleet management + +--- + +## Target Audience + +### Primary Persona +- **Role**: SRE, Platform Engineer, Operations Engineer +- **Pain Points**: Lack of visibility into router fleet, difficulty tracking deployments, manual inventory management +- **Goals**: Centralized operational view, proactive issue detection, deployment verification + +### Secondary Personas +- Engineering Managers needing deployment status for planning +- Security Engineers auditing deployed versions +- DevOps Engineers verifying rollouts + +--- + +## Use Cases + +### Use Case 1: Deployment Verification +**Scenario**: Verify that a new router version has rolled out across all production instances +**How it works**: Open Cluster Management dashboard, filter by production cluster, confirm all instances show new version +**Outcome**: Confident verification that deployment completed successfully + +### Use Case 2: Resource Issue Detection +**Scenario**: Proactively identify routers experiencing resource pressure +**How it works**: Monitor dashboard for high CPU/memory utilization, observe trend arrows, investigate affected instances +**Outcome**: Early detection and resolution of resource issues before they impact users + +### Use Case 3: Multi-Cluster Fleet Overview +**Scenario**: Operations team needs visibility across dev, staging, and production routers +**How it works**: Configure routers with appropriate CLUSTER_NAME, view all clusters in dashboard, drill into specific clusters +**Outcome**: Unified operational view across all environments from single interface + +--- + +## Technical Summary + +### How It Works +Routers with version 0.66.1+ automatically send periodic telemetry data to Cosmo Cloud via OpenTelemetry instrumentation. This data includes uptime, resource utilization, version information, and deployment details. Cosmo Studio aggregates this data to display all running instances. Instances that don't report within 45 seconds are considered offline. + +### Key Technical Features +- Automatic OpenTelemetry-based data collection +- Service name configuration via `TELEMETRY_SERVICE_NAME` +- Instance ID configuration via `INSTANCE_ID` environment variable +- Cluster grouping via `CLUSTER_NAME` environment variable +- CPU and memory utilization with trend indicators +- Uptime tracking (process and graph composition) +- Online/offline status detection (45-second threshold) + +### Integration Points +- OpenTelemetry (automatic instrumentation) +- Cosmo Studio dashboard +- Environment variable configuration + +### Requirements & Prerequisites +- Router version 0.66.1 or later +- Network connectivity from router to Cosmo Cloud +- Optional: Environment variables for service name, instance ID, and cluster name + +--- + +## Documentation References + +- Primary docs: `/docs/studio/cluster-management` +- Router configuration: `/docs/router/configuration` +- OpenTelemetry setup: `/docs/router/observability` + +--- + +## Keywords & SEO + +### Primary Keywords +- Router cluster management +- GraphQL fleet monitoring +- Router health monitoring + +### Secondary Keywords +- Federation fleet management +- Router instance tracking +- Cosmo Studio monitoring + +### Related Search Terms +- Monitor GraphQL routers +- Router fleet visibility +- Track router deployments diff --git a/capabilities/deployment/cosmo-cloud.md b/capabilities/deployment/cosmo-cloud.md new file mode 100644 index 00000000..8f6afc26 --- /dev/null +++ b/capabilities/deployment/cosmo-cloud.md @@ -0,0 +1,138 @@ +# Cosmo Cloud + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-deploy-001` | +| **Category** | Deployment | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-deploy-002`, `cap-deploy-008` | + +--- + +## Quick Reference + +### Name +Cosmo Cloud + +### Tagline +Fully managed GraphQL federation platform. + +### Elevator Pitch +Cosmo Cloud is a fully managed SaaS platform that handles all critical components of the Cosmo Platform, eliminating infrastructure worries so you can focus on building. With a generous free tier of 10 million monthly requests and options for Hybrid SaaS and On-Premises deployments, it scales with your needs. + +--- + +## Problem & Solution + +### The Problem +Managing a GraphQL federation platform requires significant operational expertise, infrastructure investment, and ongoing maintenance. Teams spend valuable engineering time on infrastructure management instead of building features. Scaling, security updates, high availability, and compliance requirements add complexity that distracts from core business activities. + +### The Solution +Cosmo Cloud provides a fully managed service that operates all critical components of the Cosmo Platform. You only need to run your routers while WunderGraph handles the control plane, studio, CDN, and all supporting infrastructure. This allows teams to concentrate on building their GraphQL APIs without infrastructure overhead. + +### Before & After + +| Before Cosmo Cloud | With Cosmo Cloud | +|-------------------|------------------| +| Deploy and manage control plane, databases, CDN | Managed infrastructure handled by WunderGraph | +| Configure high availability and disaster recovery | Built-in reliability and redundancy | +| Handle security patches and upgrades | Automatic platform updates | +| Scale infrastructure manually | Auto-scaling based on demand | +| Dedicated DevOps resources for maintenance | Focus entirely on API development | + +--- + +## Key Benefits + +1. **Zero Infrastructure Management**: All critical platform components are managed for you, eliminating operational overhead +2. **Generous Free Tier**: Start with 10 million monthly requests at no cost, making it accessible for teams of all sizes +3. **Enterprise-Ready Options**: Custom plans available for Hybrid SaaS, On-Premises deployments, and extended data retention +4. **Compliance Support**: Options for strict compliance requirements and data sovereignty needs +5. **Focus on Building**: Redirect engineering resources from infrastructure to feature development + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer, Engineering Manager +- **Pain Points**: Limited DevOps resources, need to move fast without infrastructure complexity, compliance requirements +- **Goals**: Ship GraphQL APIs quickly, minimize operational burden, ensure platform reliability + +### Secondary Personas +- API Developers who want to focus on schema design and resolver implementation +- Startup CTOs who need enterprise-grade infrastructure without dedicated ops teams +- Enterprise architects evaluating managed vs. self-hosted federation solutions + +--- + +## Use Cases + +### Use Case 1: Startup Rapid Development +**Scenario**: A startup wants to adopt GraphQL federation but lacks dedicated DevOps resources +**How it works**: Sign up for Cosmo Cloud, create federated graphs through the Studio interface, deploy subgraphs, and let WunderGraph handle all infrastructure +**Outcome**: Production-ready GraphQL federation in days instead of weeks, with room to grow within the free tier + +### Use Case 2: Enterprise Compliance Deployment +**Scenario**: An enterprise needs GraphQL federation with strict data retention and compliance requirements +**How it works**: Contact WunderGraph for a custom enterprise plan with extended data retention, dedicated support, and compliance certifications +**Outcome**: Enterprise-grade federation platform that meets regulatory requirements without in-house infrastructure investment + +### Use Case 3: Hybrid SaaS Architecture +**Scenario**: A company wants managed control plane benefits but needs to run routers in their own environment for data locality +**How it works**: Use Cosmo Cloud for the control plane and studio while deploying routers in your own infrastructure (AWS, GCP, Azure, on-premises) +**Outcome**: Best of both worlds with managed platform operations and data control in your environment + +--- + +## Technical Summary + +### How It Works +Cosmo Cloud provides a hosted control plane that manages your federated graph configurations, schema registry, and composition. The Studio interface enables graph management, monitoring, and analytics. You deploy routers in your preferred environment that connect to Cosmo Cloud to fetch configurations and report telemetry. + +### Key Technical Features +- Hosted control plane with Platform API and Node API +- Web-based Studio for management and monitoring +- CDN-backed configuration distribution +- Integrated schema registry and composition +- Real-time analytics and observability + +### Integration Points +- Router deployment in any environment (cloud, on-premises, edge) +- CI/CD integration via CLI (wgc) and Terraform provider +- OpenTelemetry-compatible observability + +### Requirements & Prerequisites +- Account on cosmo.wundergraph.com +- Router deployed in your environment +- Network connectivity between router and Cosmo Cloud + +--- + +## Documentation References + +- Primary docs: `/docs/deployments-and-hosting/cosmo-cloud` +- Getting started: `/docs/getting-started` +- Router configuration: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- Managed GraphQL federation +- GraphQL platform as a service +- Hosted federation control plane + +### Secondary Keywords +- GraphQL SaaS +- Managed API gateway +- Federation hosting + +### Related Search Terms +- Apollo GraphOS alternative +- Managed supergraph platform +- GraphQL federation hosting diff --git a/capabilities/deployment/docker.md b/capabilities/deployment/docker.md new file mode 100644 index 00000000..6a6be100 --- /dev/null +++ b/capabilities/deployment/docker.md @@ -0,0 +1,151 @@ +# Docker + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-deploy-005` | +| **Category** | Deployment | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-deploy-002`, `cap-deploy-003` | + +--- + +## Quick Reference + +### Name +Docker Deployment + +### Tagline +Run the complete Cosmo stack with docker-compose. + +### Elevator Pitch +Get the complete Cosmo Platform running on your local machine in minutes using Docker Compose. Clone the repository, run a single command, and have a fully functional federation environment for development, testing, and evaluation. Ideal for exploring Cosmo features before committing to a production deployment. + +--- + +## Problem & Solution + +### The Problem +Teams evaluating GraphQL federation platforms need a way to quickly run the full stack locally without complex infrastructure setup. Setting up individual components manually is time-consuming and error-prone. Developers need a consistent local environment that mirrors the production architecture for effective development and testing. + +### The Solution +Cosmo provides an official Docker Compose configuration that starts the entire platform stack with a single command. Follow the Getting Started guide to have a working federation environment in minutes. This approach is perfect for demos, local development, proof-of-concept work, and CI testing. + +### Before & After + +| Before Docker Compose | With Docker Compose | +|----------------------|---------------------| +| Manual component installation | Single command startup | +| Complex dependency management | Pre-configured containers | +| Hours of setup time | Running in 3 minutes | +| Environment inconsistencies | Reproducible local environment | +| Difficulty evaluating platform | Quick hands-on exploration | + +--- + +## Key Benefits + +1. **Rapid Setup**: Get the complete Cosmo Platform running in minutes, not hours +2. **Zero Configuration**: Pre-configured docker-compose file handles all component orchestration +3. **Complete Environment**: Run all platform components locally for realistic testing +4. **Evaluation Ready**: Perfect for demos, proofs-of-concept, and platform evaluation +5. **Development Workflow**: Consistent local environment for feature development + +--- + +## Target Audience + +### Primary Persona +- **Role**: Developer, Platform Engineer, Technical Evaluator +- **Pain Points**: Need to quickly evaluate federation platforms, complex local setup requirements, time pressure for POCs +- **Goals**: Fast hands-on experience with the platform, consistent development environment + +### Secondary Personas +- Solution Architects evaluating Cosmo for their organization +- DevOps Engineers testing CI/CD pipelines locally +- Technical Writers and Developer Advocates creating tutorials + +--- + +## Use Cases + +### Use Case 1: Platform Evaluation +**Scenario**: A technical lead needs to evaluate Cosmo for their organization +**How it works**: Clone the Cosmo repository, run the Getting Started command, explore Studio, create federated graphs, test queries +**Outcome**: Comprehensive hands-on evaluation of the platform in under an hour + +### Use Case 2: Local Development Environment +**Scenario**: A developer needs to work on subgraph changes with the full federation stack +**How it works**: Start the Docker Compose stack, configure local subgraph to connect to local federation, develop and test +**Outcome**: Rapid iteration on federation features with full local visibility + +### Use Case 3: CI Integration Testing +**Scenario**: Test federation changes in CI before deploying to staging +**How it works**: Spin up Docker Compose stack in CI environment, run integration tests against local federation +**Outcome**: Catch federation issues early in the development cycle + +--- + +## Technical Summary + +### How It Works +The Docker Compose configuration defines all Cosmo Platform components and their dependencies. Running `docker-compose up` starts containers for the control plane, studio, router, metrics collection, and all required storage backends (PostgreSQL, ClickHouse, Redis, etc.). Components are pre-configured to communicate with each other on a Docker network. + +### Key Technical Features +- Complete platform stack in containers +- Pre-configured networking between components +- Volume mounts for data persistence +- Environment variable configuration +- Full docker-compose.full.yml available in repository + +### Integration Points +- Docker Desktop or Docker Engine +- Local development tools and IDEs +- CI/CD systems supporting Docker +- Local subgraph services + +### Requirements & Prerequisites +- Docker Desktop or Docker Engine installed +- Docker Compose v2.x +- Git for cloning the repository +- Sufficient system resources (RAM, CPU) +- Ports available for platform components + +--- + +## Competitive Positioning + +### Important Notes +- The Docker Compose stack is intended for development and evaluation, not production deployments +- For production deployments, use Kubernetes with Helm charts or contact WunderGraph for deployment assistance +- For managed production environments, consider Cosmo Cloud + +--- + +## Documentation References + +- Primary docs: `/docs/deployments-and-hosting/docker` +- Getting started: `https://github.com/wundergraph/cosmo#demo-cosmo-on-your-machine-in-3-minutes` +- Docker Compose file: `https://github.com/wundergraph/cosmo/blob/main/docker-compose.full.yml` +- Repository: `https://github.com/wundergraph/cosmo` + +--- + +## Keywords & SEO + +### Primary Keywords +- Cosmo Docker +- GraphQL federation Docker +- Local federation development + +### Secondary Keywords +- Docker Compose GraphQL +- Federation local environment +- GraphQL development setup + +### Related Search Terms +- Run Cosmo locally +- GraphQL federation Docker Compose +- Local federation testing environment diff --git a/capabilities/deployment/kubernetes.md b/capabilities/deployment/kubernetes.md new file mode 100644 index 00000000..5998c305 --- /dev/null +++ b/capabilities/deployment/kubernetes.md @@ -0,0 +1,144 @@ +# Kubernetes (Helm) + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-deploy-003` | +| **Category** | Deployment | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-deploy-002`, `cap-deploy-004`, `cap-deploy-005` | + +--- + +## Quick Reference + +### Name +Kubernetes (Helm) + +### Tagline +Production-grade Helm charts for any Kubernetes cluster. + +### Elevator Pitch +Deploy the complete Cosmo Platform to any Kubernetes service using production-ready Helm charts. Whether you're using EKS, AKS, GKE, or an on-premises cluster, the Cosmo Helm chart packages everything you need for a reliable, maintainable, and scalable federation deployment. + +--- + +## Problem & Solution + +### The Problem +Deploying a complex platform like GraphQL federation to Kubernetes requires careful orchestration of multiple services, databases, and configurations. Creating and maintaining custom Kubernetes manifests is time-consuming, error-prone, and difficult to keep consistent across environments. Teams need a battle-tested deployment method that works across different Kubernetes providers. + +### The Solution +Cosmo provides a production-grade Helm chart that packages the entire platform as a collection of well-configured sub-charts. Deploy locally with Minikube for development or to any cloud Kubernetes service for production. Auto-generated documentation ensures you always have accurate configuration options, and the modular architecture lets you customize each component. + +### Before & After + +| Before Cosmo Helm Chart | With Cosmo Helm Chart | +|------------------------|----------------------| +| Write custom Kubernetes manifests | Single helm install command | +| Manual configuration of each component | Pre-configured defaults with overrides | +| Inconsistent deployments across environments | Reproducible deployments everywhere | +| Difficult upgrades and rollbacks | Helm-managed version control | +| Undocumented configuration options | Auto-generated configuration docs | + +--- + +## Key Benefits + +1. **Universal Kubernetes Support**: Deploy to EKS, AKS, GKE, or any Kubernetes cluster including Minikube +2. **Production-Ready**: Battle-tested chart structure with sensible defaults for production workloads +3. **Modular Architecture**: Six sub-charts (Controlplane, GraphQL Metrics, OTEL Collector, Studio, Router, CDN) can be configured independently +4. **Integrated Storage**: Pre-configured integration with Bitnami charts for PostgreSQL, Keycloak, ClickHouse, Minio, and Redis +5. **Auto-Generated Documentation**: Configuration options are documented automatically with every update + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer, DevOps Engineer, SRE +- **Pain Points**: Complex multi-service deployments, configuration drift, maintaining consistency across environments +- **Goals**: Reliable, repeatable deployments with minimal operational overhead + +### Secondary Personas +- Developers needing local federation environments for testing +- Infrastructure architects designing enterprise deployment strategies +- CI/CD engineers automating deployment pipelines + +--- + +## Use Cases + +### Use Case 1: Production EKS Deployment +**Scenario**: Deploy Cosmo to Amazon EKS for a production environment +**How it works**: Configure the Helm chart values for production (external databases, TLS, resource limits), run `helm install` against your EKS cluster +**Outcome**: Production-ready federation platform running on managed Kubernetes with high availability + +### Use Case 2: Local Development Environment +**Scenario**: Developers need a local federation environment matching production +**How it works**: Start Minikube, deploy Cosmo using the Helm chart with development-focused values +**Outcome**: Full federation platform running locally for rapid development and testing + +### Use Case 3: Multi-Environment CI Pipeline +**Scenario**: Deploy consistently across dev, staging, and production +**How it works**: Maintain environment-specific values files, use Helm in CI/CD pipeline to deploy to each environment +**Outcome**: Identical deployment process across all environments with environment-specific configurations + +--- + +## Technical Summary + +### How It Works +The Cosmo Helm chart is an umbrella chart containing six sub-charts for each platform component. Storage components are integrated through external Bitnami Helm charts. Run `helm install` with your customized values file to deploy the entire stack. Each sub-chart can be independently configured or disabled if using external services. + +### Key Technical Features +- Umbrella chart architecture with independent sub-charts +- Controlplane, GraphQL Metrics, OTEL Collector, Studio, Router, CDN components +- Bitnami integration for PostgreSQL, Keycloak, ClickHouse, Minio, Redis +- Values files for different environments +- Helm lifecycle hooks for migrations + +### Integration Points +- Amazon EKS, Azure AKS, Google GKE +- Minikube for local development +- External databases and storage services +- Ingress controllers and load balancers +- CI/CD systems (GitHub Actions, GitLab CI, Jenkins) + +### Requirements & Prerequisites +- Kubernetes cluster (1.19+) +- Helm 3.x installed +- kubectl configured for your cluster +- Sufficient cluster resources (varies by component configuration) +- Optional: External storage services for production + +--- + +## Documentation References + +- Primary docs: `/docs/deployments-and-hosting/kubernetes` +- Helm chart overview: `/docs/deployments-and-hosting/kubernetes/helm-chart` +- Local development: `https://github.com/wundergraph/cosmo/blob/main/helm/README.md` +- Chart documentation: `https://github.com/wundergraph/cosmo/blob/main/helm/cosmo/README.md` +- Sub-charts: `https://github.com/wundergraph/cosmo/tree/main/helm/cosmo/charts` + +--- + +## Keywords & SEO + +### Primary Keywords +- Cosmo Helm chart +- GraphQL federation Kubernetes +- Kubernetes federation deployment + +### Secondary Keywords +- Helm chart federation +- EKS GraphQL deployment +- AKS federation platform + +### Related Search Terms +- Deploy GraphQL federation to Kubernetes +- Helm chart GraphQL gateway +- Production Kubernetes GraphQL diff --git a/capabilities/deployment/router-compatibility-versions.md b/capabilities/deployment/router-compatibility-versions.md new file mode 100644 index 00000000..a9db4fc3 --- /dev/null +++ b/capabilities/deployment/router-compatibility-versions.md @@ -0,0 +1,138 @@ +# Router Compatibility Versions + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-deploy-007` | +| **Category** | Deployment | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-deploy-001`, `cap-deploy-008` | + +--- + +## Quick Reference + +### Name +Router Compatibility Versions + +### Tagline +Safe upgrades with version handshake protection. + +### Elevator Pitch +Router Compatibility Versions ensure that your router and its execution configuration are always compatible. When breaking changes occur between composition and router execution, Cosmo manages versioned configurations so routers only receive compatible configurations. Upgrade routers confidently knowing incompatible configurations will be rejected. + +--- + +## Problem & Solution + +### The Problem +In a federated architecture, the router depends on execution configurations generated by the composition process. Breaking changes between how configurations are structured and how routers interpret them can cause production incidents. Teams need a way to upgrade routers and compositions safely without risking incompatibility. + +### The Solution +Cosmo implements a version handshake between router execution configurations and routers. Each configuration includes a `compatibilityVersion` that the router validates at startup. If the version exceeds the router's threshold, it produces an error and refuses to start. This prevents running routers with incompatible configurations. + +### Before & After + +| Before Compatibility Versions | With Compatibility Versions | +|------------------------------|----------------------------| +| Risk of incompatible configs | Version handshake validation | +| Silent configuration failures | Clear error messages on mismatch | +| Complex upgrade coordination | Safe incremental upgrades | +| Single configuration storage | Versioned configuration storage | + +--- + +## Key Benefits + +1. **Safe Upgrades**: Version handshake prevents routers from running incompatible configurations +2. **Clear Failure Mode**: Routers fail fast with explicit error messages when versions mismatch +3. **Independent Storage**: Different router compatibility versions stored at separate CDN addresses +4. **Gradual Migration**: Run routers on different versions simultaneously during migrations +5. **Backwards Compatibility**: Older configurations remain accessible for routers that haven't upgraded + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer, SRE +- **Pain Points**: Risky upgrades, fear of configuration incompatibility, coordination overhead during migrations +- **Goals**: Confident, safe router upgrades with rollback capability + +### Secondary Personas +- Release Engineers managing federation deployments +- DevOps Engineers automating upgrade pipelines +- Engineering Managers planning upgrade rollouts + +--- + +## Use Cases + +### Use Case 1: Safe Production Upgrade +**Scenario**: Upgrade production routers to support new federation features +**How it works**: Deploy new router version (with higher compatibility threshold) alongside existing routers, verify functionality, gradually shift traffic, retire old routers +**Outcome**: Zero-downtime upgrade with fallback option at each stage + +### Use Case 2: Canary Deployment +**Scenario**: Test new router version with production traffic before full rollout +**How it works**: Deploy single new-version router to canary pool, monitor behavior, new router automatically receives version-appropriate configuration +**Outcome**: Real-world validation of new router version without affecting all traffic + +### Use Case 3: Multi-Version Fleet Management +**Scenario**: Run different router versions across environments during extended migration +**How it works**: Staging runs newer router version, production runs previous version, each receives appropriate configuration version +**Outcome**: Flexible upgrade timeline with environment-appropriate configurations + +--- + +## Technical Summary + +### How It Works +When Cosmo composes a federated graph, it generates an execution configuration with a `compatibilityVersion` property (format: `:`). Each router version defines an internal compatibility threshold. At startup, the router validates that the configuration's version doesn't exceed its threshold. Cosmo Cloud stores configurations discretely by version, with routers polling the address matching their threshold. + +### Key Technical Features +- Version property format: `:` +- Router startup validation with clear error messages +- Discrete storage per compatibility version +- Automatic routing of configuration requests by version +- CLI commands for version management and listing + +### Integration Points +- Cosmo CDN for versioned configuration storage +- Router startup validation +- CLI commands: `wgc federated-graph version set`, `compatibility-version list` + +### Requirements & Prerequisites +- Understanding of current router version and its compatibility threshold +- Awareness of when new compatibility versions are released +- Coordination plan for upgrades (if required) + +--- + +## Documentation References + +- Primary docs: `/docs/concepts/router-compatibility-versions` +- Upgrading the router: `/docs/router/upgrading-the-router` +- Router configuration: `/docs/router/configuration#execution-config` +- CLI reference: `/docs/cli/federated-graph` + +--- + +## Keywords & SEO + +### Primary Keywords +- Router compatibility versions +- Federation version management +- Router upgrade safety + +### Secondary Keywords +- Configuration versioning +- Safe router upgrades +- Version handshake + +### Related Search Terms +- Upgrade GraphQL router safely +- Federation configuration compatibility +- Router version management diff --git a/capabilities/deployment/self-hosted.md b/capabilities/deployment/self-hosted.md new file mode 100644 index 00000000..182ab504 --- /dev/null +++ b/capabilities/deployment/self-hosted.md @@ -0,0 +1,142 @@ +# Self-Hosted Deployment + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-deploy-002` | +| **Category** | Deployment | +| **Status** | GA | +| **Availability** | Enterprise | +| **Related Capabilities** | `cap-deploy-001`, `cap-deploy-003`, `cap-deploy-004`, `cap-deploy-005` | + +--- + +## Quick Reference + +### Name +Self-Hosted Deployment + +### Tagline +Full data sovereignty with on-premises deployment. + +### Elevator Pitch +Unlike other GraphQL federation platforms, WunderGraph Cosmo can be deployed entirely self-hosted, giving you complete control over your infrastructure and full data sovereignty. Deploy to any Kubernetes service including EKS, AKS, GKE, or your own on-premises cluster using production-grade Helm charts. + +--- + +## Problem & Solution + +### The Problem +Many organizations have strict requirements around data sovereignty, compliance, and infrastructure control that prevent them from using cloud-hosted platforms. They need the benefits of a modern GraphQL federation platform while keeping all data and infrastructure within their own environment. This is especially critical for financial services, healthcare, government, and other regulated industries. + +### The Solution +Cosmo provides a fully self-hosted deployment option where you manage and deploy the entire platform in your own environment. This gives you complete data sovereignty while still accessing all the features of the Cosmo Platform. Production-grade Helm charts make deployment to any Kubernetes cluster manageable and reliable. + +### Before & After + +| Before Cosmo Self-Hosted | With Cosmo Self-Hosted | +|-------------------------|------------------------| +| Limited to cloud-only federation platforms | Deploy federation in any environment | +| Data leaves your network | Full data sovereignty within your infrastructure | +| Dependent on vendor availability | Complete control over uptime and maintenance windows | +| Compliance blockers for regulated industries | Meet strict regulatory requirements | +| One-size-fits-all configuration | Customize infrastructure to your needs | + +--- + +## Key Benefits + +1. **Full Data Sovereignty**: All data remains within your infrastructure, meeting the strictest compliance requirements +2. **Infrastructure Control**: Deploy to your choice of Kubernetes service (EKS, AKS, GKE) or on-premises clusters +3. **Production-Grade Deployment**: Use battle-tested Helm charts for reliable, repeatable deployments +4. **Key Differentiator**: One of the few GraphQL federation platforms offering true self-hosted deployment +5. **Flexible Scaling**: Scale components independently based on your workload requirements + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer, Infrastructure Architect +- **Pain Points**: Strict compliance requirements, data sovereignty needs, vendor lock-in concerns, regulatory constraints +- **Goals**: Deploy production-grade federation while maintaining full infrastructure control + +### Secondary Personas +- Security Officers ensuring compliance with data residency requirements +- CTOs at regulated enterprises (finance, healthcare, government) +- Engineering Managers seeking predictable infrastructure costs + +--- + +## Use Cases + +### Use Case 1: Financial Services Deployment +**Scenario**: A bank needs GraphQL federation but cannot send data outside their private cloud +**How it works**: Deploy the complete Cosmo stack to their private Kubernetes cluster using Helm charts, configure integration with existing authentication and monitoring systems +**Outcome**: Production GraphQL federation with all data staying within the bank's secure infrastructure + +### Use Case 2: Air-Gapped Environment +**Scenario**: A government contractor needs federation in a network with no external connectivity +**How it works**: Pull container images and Helm charts, deploy to an isolated Kubernetes cluster, configure all components for internal-only operation +**Outcome**: Fully functional federation platform operating without any external dependencies + +### Use Case 3: Multi-Region Private Deployment +**Scenario**: A global enterprise needs federation deployed across multiple private data centers +**How it works**: Deploy Cosmo stacks to Kubernetes clusters in each region, configure for high availability and disaster recovery +**Outcome**: Globally distributed federation with data locality maintained in each region + +--- + +## Technical Summary + +### How It Works +Cosmo's self-hosted deployment uses an umbrella Helm chart that packages all platform components. The chart includes sub-charts for Controlplane, GraphQL Metrics, OTEL Collector, Studio, Router, and CDN. Storage components (PostgreSQL, Keycloak, ClickHouse, Minio, Redis) are managed through Bitnami Helm charts. Deploy the entire stack with a single helm install command. + +### Key Technical Features +- Umbrella Helm chart with modular sub-charts +- Integration with external Helm charts for storage components +- Support for any Kubernetes distribution +- Configurable for development (Minikube) through production environments +- Auto-generated configuration documentation + +### Integration Points +- Kubernetes (EKS, AKS, GKE, on-premises) +- External authentication providers via Keycloak +- Existing monitoring and logging infrastructure +- CI/CD pipelines for infrastructure deployment + +### Requirements & Prerequisites +- Kubernetes cluster (1.19+) +- Helm 3.x +- Sufficient cluster resources for all components +- Network connectivity between components +- Optional: External PostgreSQL, ClickHouse, Redis for production + +--- + +## Documentation References + +- Primary docs: `/docs/deployments-and-hosting/intro` +- Kubernetes deployment: `/docs/deployments-and-hosting/kubernetes` +- Helm chart guide: `/docs/deployments-and-hosting/kubernetes/helm-chart` +- GitHub repository: `https://github.com/wundergraph/cosmo/tree/main/helm/cosmo` + +--- + +## Keywords & SEO + +### Primary Keywords +- Self-hosted GraphQL federation +- On-premises federation platform +- Data sovereignty GraphQL + +### Secondary Keywords +- Private cloud GraphQL +- Enterprise federation deployment +- Air-gapped GraphQL platform + +### Related Search Terms +- Apollo GraphOS alternative self-hosted +- Deploy federation on-premises +- GraphQL federation Kubernetes diff --git a/capabilities/deployment/storage-providers.md b/capabilities/deployment/storage-providers.md new file mode 100644 index 00000000..a7704648 --- /dev/null +++ b/capabilities/deployment/storage-providers.md @@ -0,0 +1,141 @@ +# Storage Providers + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-deploy-006` | +| **Category** | Deployment | +| **Status** | GA | +| **Availability** | Pro / Enterprise | +| **Related Capabilities** | `cap-deploy-001`, `cap-deploy-002` | + +--- + +## Quick Reference + +### Name +Storage Providers + +### Tagline +Control your data with custom artifact storage. + +### Elevator Pitch +Store router execution configurations and persisted operations on your own infrastructure using Amazon S3 or any S3-compatible storage (like Minio). Maintain full control over your data while still benefiting from Cosmo Cloud features. Remove dependencies on external services and meet strict data residency requirements. + +--- + +## Problem & Solution + +### The Problem +Organizations using Cosmo Cloud may need to keep certain artifacts within their own infrastructure for data sovereignty, compliance, or performance reasons. Relying on external CDNs for critical configuration files creates dependencies that may not be acceptable for regulated industries or air-gapped environments. + +### The Solution +Cosmo's storage providers feature allows you to configure Amazon S3 or any S3-compatible storage as the source for router execution configurations and persisted operations. The router fetches configurations from your storage instead of Cosmo's CDN, giving you complete control over artifact storage and distribution. + +### Before & After + +| Before Custom Storage | With Custom Storage | +|----------------------|---------------------| +| Artifacts stored on Cosmo CDN | Artifacts in your own S3 buckets | +| External dependency for router operations | Self-contained router infrastructure | +| Limited data locality control | Full data residency compliance | +| Single point of configuration retrieval | Configurable fallback storage options | + +--- + +## Key Benefits + +1. **Data Sovereignty**: Keep all router artifacts within your own infrastructure and jurisdiction +2. **CDN Independence**: Remove dependencies on Cosmo Cloud for router operations +3. **S3 Compatibility**: Works with AWS S3, Minio, and any S3-compatible storage service +4. **IAM Integration**: Use AWS IAM roles on EC2/EKS without managing access keys +5. **Fallback Configuration**: Configure backup storage providers for high availability + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer, Infrastructure Architect, Security Engineer +- **Pain Points**: Data residency requirements, external service dependencies, compliance constraints +- **Goals**: Full control over infrastructure while using managed federation features + +### Secondary Personas +- Security Officers ensuring data stays within approved boundaries +- DevOps Engineers optimizing artifact delivery performance +- Enterprise Architects designing hybrid cloud solutions + +--- + +## Use Cases + +### Use Case 1: Data Residency Compliance +**Scenario**: A European company must ensure router configurations stay within EU data centers +**How it works**: Configure S3 bucket in EU region, update CI/CD to push artifacts to S3, configure router to pull from S3 +**Outcome**: Full Cosmo Cloud functionality with artifacts stored exclusively in EU infrastructure + +### Use Case 2: Air-Gapped Environment Integration +**Scenario**: Deploy routers in a network segment without external internet access +**How it works**: Set up internal Minio instance, configure CI pipeline to push artifacts, point routers to internal storage +**Outcome**: Routers operate independently of external services using internally stored configurations + +### Use Case 3: High Availability with Fallback +**Scenario**: Ensure routers can always fetch configurations even if primary storage is unavailable +**How it works**: Configure primary S3 provider and fallback storage (CDN or secondary S3), router automatically fails over +**Outcome**: Improved reliability with automatic fallback to backup configuration source + +--- + +## Technical Summary + +### How It Works +Define storage providers in the router's `config.yaml` file with connection details for your S3 buckets. Configure execution config and persisted operations to reference these providers. During CI/CD, use `wgc router fetch` to download configurations and upload to your S3. The router polls your storage for updates and hot-reloads without impacting traffic. + +### Key Technical Features +- S3 and S3-compatible storage support (AWS S3, Minio) +- IAM role support for EC2/EKS deployments +- Configurable polling intervals (default 10 seconds) +- Hot-reload on configuration updates +- Fallback storage configuration +- Persisted operations storage with SHA256 naming + +### Integration Points +- Amazon S3 +- Minio (self-hosted S3-compatible storage) +- Any S3-compatible object storage +- AWS IAM roles for authentication +- CI/CD pipelines for artifact publishing + +### Requirements & Prerequisites +- S3-compatible storage service +- Access credentials or IAM role configuration +- CI/CD pipeline integration for artifact publishing +- Network connectivity from router to storage + +--- + +## Documentation References + +- Primary docs: `/docs/router/storage-providers` +- Router configuration: `/docs/router/configuration` +- Execution config: `/docs/router/configuration#execution-config` + +--- + +## Keywords & SEO + +### Primary Keywords +- Cosmo storage providers +- Router artifact storage +- S3 federation configuration + +### Secondary Keywords +- GraphQL router S3 +- Custom CDN federation +- Minio GraphQL storage + +### Related Search Terms +- Store router config in S3 +- Federation data residency +- Self-hosted router artifacts diff --git a/capabilities/deployment/terraform.md b/capabilities/deployment/terraform.md new file mode 100644 index 00000000..ffab828a --- /dev/null +++ b/capabilities/deployment/terraform.md @@ -0,0 +1,142 @@ +# Terraform + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-deploy-004` | +| **Category** | Deployment | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-deploy-001`, `cap-deploy-003`, `cap-deploy-005` | + +--- + +## Quick Reference + +### Name +Terraform Provider for Cosmo + +### Tagline +Infrastructure as Code for GraphQL federation. + +### Elevator Pitch +Manage your Cosmo Cloud resources programmatically using Terraform. Define your GraphQL infrastructure as code, track changes in version control, automate provisioning, and ensure consistent deployments across environments. The official Cosmo Terraform provider supports namespaces, federated graphs, subgraphs, feature flags, contracts, and more. + +--- + +## Problem & Solution + +### The Problem +Managing GraphQL federation resources manually through UIs or CLI commands leads to configuration drift, inconsistent environments, and difficulty tracking changes. Teams need to reproduce environments reliably, preview changes before applying them, and collaborate on infrastructure updates. Manual processes don't scale and create operational risk. + +### The Solution +The Cosmo Terraform provider brings Infrastructure as Code practices to GraphQL federation. Define your federated graphs, subgraphs, feature flags, and contracts in declarative configuration files. Version control your infrastructure, preview changes with `terraform plan`, and apply them consistently across environments with `terraform apply`. + +### Before & After + +| Before Terraform | With Terraform | +|-----------------|----------------| +| Manual CLI or UI configuration | Declarative configuration files | +| Configuration drift between environments | Identical infrastructure everywhere | +| No change tracking | Full version control history | +| Risky blind deployments | Preview changes before applying | +| Solo infrastructure management | Team collaboration on infra code | + +--- + +## Key Benefits + +1. **Infrastructure as Code**: Define your entire Cosmo infrastructure in declarative, version-controlled configuration files +2. **State Management**: Terraform tracks the state of your resources, detecting drift and ensuring consistency +3. **Change Preview**: Use `terraform plan` to see exactly what changes will be applied before execution +4. **Multi-Environment Support**: Easily manage dev, staging, and production with environment-specific configurations +5. **Automation Ready**: Integrate with CI/CD pipelines for fully automated infrastructure management + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer, DevOps Engineer, Infrastructure Engineer +- **Pain Points**: Manual resource management, environment inconsistency, lack of change tracking, deployment risk +- **Goals**: Automated, reproducible infrastructure with full visibility into changes + +### Secondary Personas +- Engineering Managers wanting governance over infrastructure changes +- SREs implementing GitOps workflows +- Developers needing consistent local and cloud environments + +--- + +## Use Cases + +### Use Case 1: GitOps Federation Management +**Scenario**: Implement GitOps workflow for federation infrastructure +**How it works**: Store Terraform configurations in Git, use pull requests for changes, run `terraform plan` in CI for review, apply on merge +**Outcome**: All infrastructure changes reviewed, approved, and tracked with full audit history + +### Use Case 2: Multi-Environment Provisioning +**Scenario**: Maintain identical federated graphs across dev, staging, and production +**How it works**: Define base configuration with environment-specific variable files, run Terraform for each environment +**Outcome**: Consistent federation setup across all environments with environment-specific overrides + +### Use Case 3: AWS Fargate Router Deployment +**Scenario**: Deploy highly available routers to AWS Fargate +**How it works**: Use the AWS Fargate Terraform module, configure TLS with Route53, run `terraform apply` +**Outcome**: Production-ready router deployment across multiple availability zones with automatic TLS + +--- + +## Technical Summary + +### How It Works +Configure the Cosmo Terraform provider with your API key, then define resources using HCL (HashiCorp Configuration Language). Terraform communicates with Cosmo Cloud APIs to create, update, and delete resources based on your configuration. State is tracked locally or in remote backends, enabling drift detection and change management. + +### Key Technical Features +- Official WunderGraph Terraform provider +- Support for namespaces, federated graphs, subgraphs, monographs +- Feature flags and feature subgraph management +- Contract (schema contracts) configuration +- Router token management +- AWS Fargate deployment module + +### Integration Points +- Terraform Cloud and Enterprise +- CI/CD systems (GitHub Actions, GitLab CI, Jenkins) +- Remote state backends (S3, GCS, Azure Blob) +- AWS services via Fargate module + +### Requirements & Prerequisites +- Terraform 1.0.0 or later +- Cosmo Cloud account +- API key for authentication +- Optional: AWS account for Fargate module + +--- + +## Documentation References + +- Primary docs: `/docs/deployments-and-hosting/terraform` +- AWS Fargate module: `/docs/deployments-and-hosting/terraform/aws-fargate` +- Provider documentation: `https://registry.terraform.io/providers/wundergraph/cosmo/latest/docs` +- Examples: `https://github.com/wundergraph/terraform-provider-cosmo/tree/main/examples` + +--- + +## Keywords & SEO + +### Primary Keywords +- Cosmo Terraform provider +- GraphQL federation IaC +- Infrastructure as Code GraphQL + +### Secondary Keywords +- Terraform GraphQL +- Federation automation +- GitOps GraphQL federation + +### Related Search Terms +- Automate GraphQL federation +- Terraform federated graph +- AWS Fargate GraphQL router diff --git a/capabilities/developer-experience/breaking-change-overrides.md b/capabilities/developer-experience/breaking-change-overrides.md new file mode 100644 index 00000000..f4a29f03 --- /dev/null +++ b/capabilities/developer-experience/breaking-change-overrides.md @@ -0,0 +1,155 @@ +# Breaking Change Overrides + +Manual override for approved breaking changes in schema checks. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-dx-009` | +| **Category** | Developer Experience | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-dx-005`, `cap-dx-007` | + +--- + +## Quick Reference + +### Name +Breaking Change Overrides + +### Tagline +Approve intentional breaking changes safely. + +### Elevator Pitch +Breaking Change Overrides let teams approve specific breaking changes that have been intentionally reviewed and deemed safe. When a schema check fails due to breaking changes affecting known operations, teams can override the check for those specific operations. Future checks automatically pass for approved changes, enabling controlled schema evolution without blocking CI/CD pipelines. + +--- + +## Problem & Solution + +### The Problem +Schema checks are essential for catching breaking changes, but not all breaking changes are bad. Sometimes a type change is intentional and approved after consumer coordination. Without overrides, teams must either disable checks entirely (risky) or maintain workarounds that circumvent the safety system. + +### The Solution +Breaking Change Overrides provide granular control over schema check outcomes. When a check fails due to breaking changes affecting specific operations, teams can mark those changes as safe for those operations. Future checks respect these overrides while continuing to catch new, unreviewed breaking changes. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Approved breaking changes block CI/CD | Override approved changes per operation | +| All-or-nothing check enforcement | Granular operation-level control | +| Workarounds to bypass checks | Proper approval workflow | +| No visibility into approved exceptions | All overrides visible in one place | + +--- + +## Key Benefits + +1. **Granular Control**: Override specific operations, not entire checks +2. **Future-Proof Approvals**: Overrides apply to future checks automatically +3. **Ignore All Option**: One-click to ignore all current and future changes for an operation +4. **Central Visibility**: View all overrides across the namespace in one place +5. **Traceability**: Link from overrides to metrics and traces for usage verification + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Tech Lead +- **Pain Points**: Legitimate breaking changes blocking deployment; all-or-nothing check policies +- **Goals**: Enable safe schema evolution; maintain check integrity; approve changes properly + +### Secondary Personas +- API architects managing schema governance +- DevOps engineers maintaining CI/CD pipelines +- Developers working on intentional type changes + +--- + +## Use Cases + +### Use Case 1: Coordinated Type Migration +**Scenario**: A field type is being changed from String to Int after coordinating with all consumers, but the schema check fails +**How it works**: The engineer views the failed check, sees the affected operations, and marks the changes as safe for those specific operations. The check can be re-run (or future checks pass automatically). +**Outcome**: Intentional migration proceeds; safety checks remain active for uncoordinated changes + +### Use Case 2: Deprecation and Removal +**Scenario**: A deprecated field is being removed after the deprecation period, but one internal operation still uses it and will be updated separately +**How it works**: The team marks the breaking change as safe for the specific internal operation, allowing the removal to proceed while the operation is updated in a separate timeline +**Outcome**: Schema cleanup proceeds without blocking on internal tooling updates + +### Use Case 3: Bulk Override for Known Operations +**Scenario**: A major refactoring affects multiple operations that have all been reviewed and approved +**How it works**: The team uses "Ignore All" to override all breaking changes for each affected operation, approving both current and future changes +**Outcome**: Large migration proceeds smoothly with documented approvals + +--- + +## Technical Summary + +### How It Works +When a schema check detects breaking changes, it associates each change with the operations it affects. The override UI allows marking specific operation/change combinations as safe. These overrides are stored per namespace and evaluated during future checks. The overrides dashboard provides a central view of all active overrides. + +### Key Technical Features +- Per-operation override configuration +- "Ignore All" option for operation-level blanket override +- Overrides active across all graphs in a namespace +- Central override management dashboard +- Links to metrics and traces from override view +- Override timestamps and configuration details + +### Important Notes +- Applying overrides does not change the outcome of the current check run +- Only future checks respect newly configured overrides +- Overrides should be used judiciously and with proper review + +### Integration Points +- Cosmo Studio (check results page) +- Schema check pipeline +- Metrics and tracing system (for usage verification) + +### Requirements & Prerequisites +- Cosmo account +- Schema checks enabled +- Namespace configured + +--- + +## Documentation References + +- Primary docs: `/docs/studio/overrides` +- Schema checks: `/docs/cli/subgraph/check` +- Changelog: `/docs/studio/changelog` + +--- + +## Keywords & SEO + +### Primary Keywords +- Breaking change override +- Schema check exceptions +- Approved breaking changes + +### Secondary Keywords +- Schema governance +- Check bypass +- Change approval workflow + +### Related Search Terms +- Override GraphQL breaking change check +- Approve breaking changes GraphQL +- Schema check exception handling + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/developer-experience/changelog.md b/capabilities/developer-experience/changelog.md new file mode 100644 index 00000000..064e1d66 --- /dev/null +++ b/capabilities/developer-experience/changelog.md @@ -0,0 +1,151 @@ +# Changelog + +Track all graph modifications with detailed attribution and chronological history. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-dx-005` | +| **Category** | Developer Experience | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-dx-004`, `cap-dx-009` | + +--- + +## Quick Reference + +### Name +Changelog + +### Tagline +Complete history of every schema change. + +### Elevator Pitch +Cosmo's Changelog provides a detailed, chronological history of all schema changes to your federated graph. See exactly what types, fields, and directives were added or removed, when changes occurred, and track the evolution of your API over time. Color-coded additions and deletions make it easy to understand the impact of each change at a glance. + +--- + +## Problem & Solution + +### The Problem +Tracking schema evolution across a federated graph is difficult. Teams lack visibility into what changed, when it changed, and who made the change. Debugging issues requires piecing together information from multiple sources, and understanding the impact of historical changes means digging through git history across multiple repositories. + +### The Solution +The Changelog automatically captures and displays every schema modification in chronological order. Additions appear in green, deletions in red, providing instant visual understanding. Each entry shows the specific elements affected - types, fields, directives - and when the change was made. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Git archaeology across multiple repos | Single chronological view of all changes | +| No federated-level change visibility | See changes to the composed schema | +| Manual change tracking | Automatic capture of all modifications | +| Text-based diff hunting | Color-coded additions and deletions | + +--- + +## Key Benefits + +1. **Chronological History**: All changes ordered by time, most recent first +2. **Visual Clarity**: Green for additions, red for deletions - instant understanding +3. **Complete Coverage**: Types, fields, directives, and all schema elements tracked +4. **Impact Assessment**: Understand exactly what each change affected +5. **Automatic Capture**: No manual tracking required; changes recorded automatically + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Tech Lead +- **Pain Points**: Understanding schema evolution; debugging issues caused by schema changes; lack of change visibility +- **Goals**: Track schema changes; understand when breaking changes occurred; maintain schema governance + +### Secondary Personas +- API developers debugging unexpected behavior +- Engineering managers reviewing API evolution +- Compliance teams auditing API changes + +--- + +## Use Cases + +### Use Case 1: Debugging Production Issues +**Scenario**: A production issue started occurring yesterday and the team suspects a recent schema change is the cause +**How it works**: The team opens the Changelog, filters to changes from the past two days, and identifies that a field type was changed from non-nullable to nullable +**Outcome**: Root cause identified quickly; team can revert or fix the change + +### Use Case 2: API Evolution Review +**Scenario**: A tech lead needs to understand how the API has evolved over the past quarter for a planning meeting +**How it works**: The tech lead browses the Changelog, seeing all additions (new capabilities) and deletions (deprecated features removed) over the period +**Outcome**: Clear picture of API evolution supports informed planning decisions + +### Use Case 3: Change Impact Analysis +**Scenario**: Before removing a deprecated field, the team wants to review all related changes that have been made +**How it works**: Using the Changelog, the team finds when the field was first deprecated, what related changes were made, and confirms the deprecation period has been sufficient +**Outcome**: Informed decision to proceed with removal, knowing the full history + +--- + +## Technical Summary + +### How It Works +The Changelog automatically captures schema changes when compositions occur. Each composition is compared against the previous version to identify additions and deletions. Changes are stored and presented in a chronological list, with detailed information about what elements were affected. + +### Key Technical Features +- Automatic change detection on composition +- Color-coded diff visualization (green/red) +- Chronological ordering (newest first) +- Type, field, and directive level tracking +- Composition-level change grouping + +### Change Categories Tracked +- Types (added/removed) +- Fields within types (added/removed) +- Directives (added/removed) +- Arguments (added/removed) + +### Integration Points +- Cosmo Studio +- Composition pipeline + +### Requirements & Prerequisites +- Federated graph deployed to Cosmo +- Compositions running through Cosmo + +--- + +## Documentation References + +- Primary docs: `/docs/studio/changelog` + +--- + +## Keywords & SEO + +### Primary Keywords +- Schema changelog +- API change history +- GraphQL version history + +### Secondary Keywords +- Schema evolution tracking +- Change management +- API audit trail + +### Related Search Terms +- Track GraphQL schema changes +- GraphQL change history +- Schema modification log + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/developer-experience/custom-playground-scripts.md b/capabilities/developer-experience/custom-playground-scripts.md new file mode 100644 index 00000000..8ad48922 --- /dev/null +++ b/capabilities/developer-experience/custom-playground-scripts.md @@ -0,0 +1,155 @@ +# Custom Playground Scripts + +Pre-flight and operation scripts with dynamic variables for authentication, validation, and advanced workflows. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-dx-002` | +| **Category** | Developer Experience | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-dx-001`, `cap-dx-003` | + +--- + +## Quick Reference + +### Name +Custom Playground Scripts + +### Tagline +Automate authentication and workflows in your playground. + +### Elevator Pitch +Custom Scripts enable developers to run JavaScript code before and after GraphQL operations in the Cosmo Playground. Handle OAuth token refresh, inject dynamic headers, validate responses, and transform data - all without leaving the playground. Scripts support environment variables, external API calls, and cryptographic operations. + +--- + +## Problem & Solution + +### The Problem +Testing authenticated GraphQL APIs is cumbersome. Developers must manually obtain tokens, copy them into headers, and repeat this process whenever tokens expire. Response validation requires external tools, and there's no way to chain operations or transform data within the playground environment. + +### The Solution +Cosmo's Custom Scripts provide a programmable layer around playground operations. Pre-flight scripts run globally across all tabs (perfect for authentication), while pre-operation and post-operation scripts run per-tab for specific workflows. Environment variables keep secrets out of scripts, and the built-in CryptoJS library enables secure token handling. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manually copy/paste tokens for each request | Pre-flight scripts auto-refresh tokens | +| No response validation in playground | Post-operation scripts validate responses | +| Secrets hardcoded or exposed | Environment variables stored securely in browser | +| Switch tools for complex workflows | Complete workflows within the playground | + +--- + +## Key Benefits + +1. **Automated Authentication**: Pre-flight scripts handle OAuth flows, token refresh, and header injection automatically +2. **Response Validation**: Post-operation scripts verify responses meet expectations before moving forward +3. **Secure Secret Management**: Environment variables keep sensitive data out of scripts and stored locally in your browser +4. **External API Integration**: Fetch data from external services and incorporate it into your GraphQL workflows +5. **Cryptographic Operations**: Built-in CryptoJS support for encryption, decryption, and token handling + +--- + +## Target Audience + +### Primary Persona +- **Role**: API Developer / Frontend Developer +- **Pain Points**: Repetitive authentication setup; manual token management; inability to validate responses inline +- **Goals**: Streamline API testing; automate repetitive tasks; validate API behavior efficiently + +### Secondary Personas +- QA engineers testing authenticated endpoints +- DevOps teams creating reproducible test scenarios +- Security engineers testing auth flows + +--- + +## Use Cases + +### Use Case 1: OAuth Token Automation +**Scenario**: An API requires Bearer tokens that expire every 15 minutes, and developers waste time manually refreshing them +**How it works**: A pre-flight script calls the OAuth endpoint with client credentials from environment variables, receives a new token, and stores it in an environment variable. The header `{{token}}` syntax automatically injects the fresh token into every request. +**Outcome**: Zero manual token management; developers can focus on actual API testing + +### Use Case 2: Response Validation +**Scenario**: A team needs to ensure that user queries always return the expected fields before proceeding with dependent operations +**How it works**: A post-operation script checks that `playground.response.body.data.user` exists and contains required fields. If validation fails, it logs a warning to the console. +**Outcome**: Immediate feedback on API response structure without switching to external tools + +### Use Case 3: Data Transformation and Logging +**Scenario**: A developer needs to anonymize PII in responses before sharing screenshots or recordings +**How it works**: A post-operation script accesses the response, replaces sensitive fields like email with masked values, and logs the transformed response to the console +**Outcome**: Safe sharing of API responses without exposing user data + +--- + +## Technical Summary + +### How It Works +Scripts are JavaScript code blocks executed at specific points in the request lifecycle: +1. **Pre-Flight Scripts**: Run first, across all playground tabs. Ideal for authentication. +2. **Pre-Operation Scripts**: Run per-tab, after pre-flight but before the request. Tab-specific header injection or variable setup. +3. **Post-Operation Scripts**: Run per-tab, after the response. Validation, transformation, and logging. + +Scripts access the `playground` API object for environment variables, request/response bodies, and CryptoJS. + +### Key Technical Features +- Three script types: pre-flight, pre-operation, post-operation +- `playground.env.get/set` for environment variable management +- `playground.request.body` for request inspection +- `playground.response.body` for response inspection +- `playground.CryptoJS` for cryptographic operations +- `{{variable}}` syntax for header injection +- External fetch API support + +### Integration Points +- Cosmo Studio Playground +- External OAuth/auth providers +- Any REST API accessible from the browser + +### Requirements & Prerequisites +- Cosmo Studio account +- Scripts are stored at the organization level +- Environment variables are browser-local + +--- + +## Documentation References + +- Primary docs: `/docs/studio/playground/custom-scripts` +- Playground overview: `/docs/studio/playground` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL playground scripts +- API authentication automation +- Pre-request scripts + +### Secondary Keywords +- OAuth token refresh +- Response validation +- Playground environment variables + +### Related Search Terms +- Automate GraphQL authentication +- GraphQL playground pre-flight scripts +- Postman-like scripts for GraphQL + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/developer-experience/graph-pruning.md b/capabilities/developer-experience/graph-pruning.md new file mode 100644 index 00000000..09757571 --- /dev/null +++ b/capabilities/developer-experience/graph-pruning.md @@ -0,0 +1,158 @@ +# Graph Pruning + +Detect unused fields and enforce deprecation policies to maintain a clean schema. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-dx-008` | +| **Category** | Developer Experience | +| **Status** | GA | +| **Availability** | Pro / Enterprise | +| **Related Capabilities** | `cap-dx-004`, `cap-dx-007` | + +--- + +## Quick Reference + +### Name +Graph Pruning + +### Tagline +Keep your schema clean with usage-based analysis. + +### Elevator Pitch +Graph Pruning analyzes real traffic to identify unused fields, track deprecated fields still in use, and enforce deprecation-before-deletion policies. Stop accumulating dead code in your schema and make informed decisions about field removal based on actual usage data. Maintain a lean, efficient API that's easier to understand and maintain. + +--- + +## Problem & Solution + +### The Problem +GraphQL schemas accumulate unused fields over time. Teams add fields speculatively, features get removed but fields remain, and deprecated fields linger indefinitely. Without usage data, teams can't safely remove fields, and schemas become bloated and confusing. + +### The Solution +Graph Pruning combines schema analysis with real traffic data to identify unused and deprecated fields. Configurable rules flag issues during schema checks, with grace periods to avoid false positives. Teams can enforce policies requiring deprecation before deletion, ensuring consumers have time to migrate. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Unknown which fields are actually used | Usage-based identification of unused fields | +| Deprecated fields never removed | Tracking of deprecated field usage | +| Fields deleted without warning | Required deprecation before deletion | +| Schema bloat over time | Continuous pruning enforcement | + +--- + +## Key Benefits + +1. **Usage-Based Analysis**: Identify truly unused fields based on real traffic, not guesswork +2. **Deprecated Field Tracking**: See which deprecated fields are still in use and by how much +3. **Deletion Safeguards**: Require deprecation before deletion to protect consumers +4. **Configurable Grace Periods**: Avoid false positives for new fields with time-based thresholds +5. **Actionable Enforcement**: Integrate with schema checks to block or warn on violations + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Architect +- **Pain Points**: Schema bloat; fear of breaking changes; inability to remove unused fields safely +- **Goals**: Maintain a clean schema; remove dead code safely; enforce deprecation policies + +### Secondary Personas +- Engineering managers concerned about technical debt +- Developer experience teams improving API usability +- Developers wanting clear, focused schemas + +--- + +## Use Cases + +### Use Case 1: Identifying Dead Code +**Scenario**: A schema has grown to hundreds of fields over several years, and the team suspects many are unused +**How it works**: Graph Pruning is enabled with the `UNUSED_FIELDS` rule. During schema checks, any fields with zero usage in the configured period are flagged. +**Outcome**: Team identifies 47 unused fields, prioritizes removal, and reduces schema size by 15% + +### Use Case 2: Safe Deprecation Workflow +**Scenario**: A team wants to enforce that fields must be deprecated for at least 30 days before deletion +**How it works**: The `REQUIRE_DEPRECATION_BEFORE_DELETION` rule is enabled. If a schema check attempts to remove a field that wasn't previously marked as deprecated, the check fails. +**Outcome**: Consumers always have advance warning of field removals; no surprise breaking changes + +### Use Case 3: Deprecated Field Cleanup +**Scenario**: Fields have been deprecated for months, but the team doesn't know if they're safe to remove +**How it works**: The `DEPRECATED_FIELDS` rule flags deprecated fields along with their current usage. Fields with zero usage are identified as safe to remove. +**Outcome**: Team removes 12 deprecated fields with zero usage, cleaning up the schema + +--- + +## Technical Summary + +### How It Works +Graph Pruning rules run during schema check operations. The linter analyzes the schema and queries the analytics pipeline for field usage data. Rules compare the schema state (new, deprecated, deleted) against usage patterns within configured time windows. + +### Available Rules + +1. **UNUSED_FIELDS**: Identifies fields with no usage within the check period +2. **DEPRECATED_FIELDS**: Flags deprecated fields that still appear in the schema +3. **REQUIRE_DEPRECATION_BEFORE_DELETION**: Fails checks when fields are deleted without prior deprecation + +### Configuration Options + +**Severity Level:** +- Error: Violations fail the check operation +- Warning: Violations are flagged but don't fail + +**Grace Period:** Time after schema publication before rules are enforced (prevents false positives for new fields) + +**Schema Usage Check Period:** Time window for usage analysis (Enterprise: configurable; other plans: based on billing plan limits) + +### Integration Points +- Cosmo Studio (configuration UI) +- CLI `wgc subgraph check` command +- Analytics pipeline (usage data) + +### Requirements & Prerequisites +- Cosmo Pro or Enterprise plan +- Analytics data collection enabled +- Namespace configured with Graph Pruning + +--- + +## Documentation References + +- Primary docs: `/docs/studio/graph-pruning` +- Schema checks: `/docs/cli/subgraph/check` +- Schema explorer (usage view): `/docs/studio/schema-explorer` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL unused fields +- Schema cleanup +- Deprecation enforcement + +### Secondary Keywords +- Field usage analysis +- Schema pruning +- Dead code removal + +### Related Search Terms +- Find unused GraphQL fields +- Safe field deprecation GraphQL +- GraphQL schema cleanup + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/developer-experience/graphiql-playground.md b/capabilities/developer-experience/graphiql-playground.md new file mode 100644 index 00000000..41ced857 --- /dev/null +++ b/capabilities/developer-experience/graphiql-playground.md @@ -0,0 +1,146 @@ +# GraphiQL Playground++ + +Enhanced GraphQL IDE with Advanced Request Tracing (ART) visualization for testing and optimizing federated queries. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-dx-001` | +| **Category** | Developer Experience | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-dx-002`, `cap-dx-003`, `cap-dx-006` | + +--- + +## Quick Reference + +### Name +GraphiQL Playground++ + +### Tagline +Debug federation with visual query execution tracing. + +### Elevator Pitch +Cosmo's enhanced GraphiQL Playground provides developers with visual representations of query execution plans, detailed timing information, and subgraph-level inputs and outputs. Understand exactly how your federated queries execute across services with tree view and waterfall visualizations. + +--- + +## Problem & Solution + +### The Problem +Developers working with federated GraphQL architectures struggle to understand how their queries are executed across multiple subgraphs. Traditional GraphQL IDEs only show the final result, leaving developers blind to performance bottlenecks, parallel execution opportunities, and the actual data flow between services. + +### The Solution +Cosmo's Playground++ extends the standard GraphiQL experience with Advanced Request Tracing (ART) visualization. By including the `X-WG-TRACE` header, developers get detailed visual breakdowns of query execution including timing per subgraph, parallel vs sequential execution paths, and the data transformations at each step. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Only see final query response | Visual tree and waterfall views of execution | +| No visibility into subgraph performance | Precise timing for each subgraph call | +| Guessing at parallel execution | Clear visualization of parallel vs sequential paths | +| Manual debugging of slow queries | Instant identification of bottlenecks | + +--- + +## Key Benefits + +1. **Visual Execution Insight**: Tree view and waterfall visualizations show exactly how queries flow through your federated graph +2. **Performance Debugging**: Identify slow subgraphs and understand where time is spent in query execution +3. **Parallel Execution Visibility**: See which subgraph calls execute in parallel vs sequentially +4. **Subgraph-Level Details**: View inputs and outputs for each subgraph call in the execution chain +5. **Zero Configuration**: Works out of the box with the Cosmo Router using a simple header + +--- + +## Target Audience + +### Primary Persona +- **Role**: GraphQL Developer / Backend Engineer +- **Pain Points**: Debugging slow federated queries; understanding how queries are routed across subgraphs +- **Goals**: Optimize query performance; understand federation behavior; debug production issues quickly + +### Secondary Personas +- Platform engineers optimizing federated graph performance +- DevOps teams investigating latency issues +- New team members learning the federated architecture + +--- + +## Use Cases + +### Use Case 1: Performance Optimization +**Scenario**: A product catalog query is taking 2 seconds to respond and the team needs to identify the bottleneck +**How it works**: Developer runs the query in Playground++ with the `X-WG-TRACE` header, views the waterfall visualization, and immediately sees that the inventory subgraph is taking 1.5 seconds +**Outcome**: Targeted optimization of the inventory subgraph reduces overall query time by 75% + +### Use Case 2: Understanding Query Execution +**Scenario**: A new developer joins the team and needs to understand how a complex query executes across 5 subgraphs +**How it works**: The developer runs the query with tracing enabled and uses the tree view to see the complete execution plan, including which fields come from which subgraphs +**Outcome**: Developer quickly understands the federation architecture and can make informed decisions about query design + +### Use Case 3: Debugging Parallel Execution +**Scenario**: A query that should be fast is unexpectedly slow; the team suspects parallel execution isn't working as expected +**How it works**: Using the waterfall view, the team sees that calls they expected to be parallel are actually sequential due to field dependencies +**Outcome**: Query restructured to enable proper parallelization, cutting response time in half + +--- + +## Technical Summary + +### How It Works +The Playground++ integrates with Cosmo Router's Advanced Request Tracing (ART) feature. When the `X-WG-TRACE` header is included in a request, the router captures detailed execution metadata and returns it alongside the response. The Playground parses this trace data and renders it in two visualization modes: tree view (hierarchical execution) and waterfall view (timeline-based). + +### Key Technical Features +- Tree view showing hierarchical query execution +- Waterfall view showing parallel execution timing +- Subgraph-level timing metrics +- Input/output data inspection for each subgraph call +- Integrated with Advanced Request Tracing (ART) + +### Integration Points +- Cosmo Router (requires ART support) +- Cosmo Studio web interface + +### Requirements & Prerequisites +- Cosmo Router deployed with ART enabled +- `X-WG-TRACE` header included in playground requests + +--- + +## Documentation References + +- Primary docs: `/docs/studio/playground` +- ART documentation: `/docs/router/advanced-request-tracing-art` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL playground +- Query execution visualization +- Federation debugging + +### Secondary Keywords +- GraphiQL enhanced +- Request tracing +- Subgraph performance + +### Related Search Terms +- How to debug federated GraphQL queries +- GraphQL query performance visualization +- Federation query execution plan + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/developer-experience/lint-policies.md b/capabilities/developer-experience/lint-policies.md new file mode 100644 index 00000000..5cc1dfdc --- /dev/null +++ b/capabilities/developer-experience/lint-policies.md @@ -0,0 +1,169 @@ +# Lint Policies + +Customizable schema linting rules to enforce conventions and best practices. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-dx-007` | +| **Category** | Developer Experience | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-dx-008`, `cap-dx-009` | + +--- + +## Quick Reference + +### Name +Lint Policies + +### Tagline +Enforce GraphQL schema conventions automatically. + +### Elevator Pitch +Cosmo's Lint Policies enable teams to enforce GraphQL schema conventions and best practices automatically. Configure rules for naming conventions, field ordering, documentation requirements, and deprecation handling. Lint checks run on every schema check operation, catching issues before they reach production and ensuring consistency across your entire federated graph. + +--- + +## Problem & Solution + +### The Problem +GraphQL schemas across large teams become inconsistent over time. Naming conventions vary, types lack documentation, fields are deprecated without reasons, and the codebase becomes harder to maintain. Manual code review catches some issues but is inconsistent and time-consuming. + +### The Solution +Lint Policies provide configurable rules that run automatically on every schema check. Teams define their conventions once, set severity levels (error or warning), and the linter enforces them consistently. Violations are caught early in the development process, before schemas are published. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Inconsistent naming across subgraphs | Enforced naming conventions | +| Manual code review for style issues | Automated linting on every check | +| Missing documentation discovered late | Required descriptions caught early | +| Deprecated fields without context | Required deprecation reasons and dates | + +--- + +## Key Benefits + +1. **Consistent Naming**: Enforce camelCase fields, PascalCase types, UPPER_CASE enums automatically +2. **Documentation Requirements**: Ensure all types have descriptions before publishing +3. **Deprecation Standards**: Require reasons and dates for deprecated fields +4. **Configurable Severity**: Set rules as errors (block publish) or warnings (inform only) +5. **Namespace-Level Control**: Different policies for different environments + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Architect +- **Pain Points**: Inconsistent schema styles; undocumented types; deprecated fields without context +- **Goals**: Enforce team conventions; maintain schema quality; reduce review burden + +### Secondary Personas +- Tech leads ensuring code quality +- Developer experience teams establishing standards +- New team members learning conventions + +--- + +## Use Cases + +### Use Case 1: Establishing Team Standards +**Scenario**: A platform team wants to ensure all new schema types follow the team's naming conventions +**How it works**: The team enables lint policies with rules like `TYPE_NAMES_SHOULD_BE_PASCAL_CASE` and `FIELD_NAMES_SHOULD_BE_CAMEL_CASE` set to error severity. Any schema check with violations fails with clear error messages. +**Outcome**: All new schema additions follow conventions; no manual review needed for style issues + +### Use Case 2: Requiring Documentation +**Scenario**: An organization's API governance requires all types to have descriptions for consumer clarity +**How it works**: The team enables `ALL_TYPES_REQUIRE_DESCRIPTION` as an error. Schema checks fail if any type (object, interface, enum, scalar, input, union) lacks a description comment. +**Outcome**: All published types have documentation; API consumers can understand the schema + +### Use Case 3: Controlled Deprecation +**Scenario**: The team wants to ensure deprecated fields include context for consumers and a planned removal date +**How it works**: Rules `REQUIRE_DEPRECATION_REASON` and `REQUIRE_DEPRECATION_DATE` are enabled. Any `@deprecated` directive must include both a reason and date argument. +**Outcome**: Consumers know why fields are deprecated and when they'll be removed + +--- + +## Technical Summary + +### How It Works +Lint policies are configured per namespace in Cosmo Studio. When the linter is enabled, rules are evaluated during every `wgc subgraph check` operation. The linter parses the schema and evaluates it against the configured rules, reporting violations with their configured severity. + +### Available Rule Categories + +**Naming Convention Rules:** +- `FIELD_NAMES_SHOULD_BE_CAMEL_CASE` +- `TYPE_NAMES_SHOULD_BE_PASCAL_CASE` +- `SHOULD_NOT_HAVE_TYPE_PREFIX/SUFFIX` +- `SHOULD_NOT_HAVE_INPUT_PREFIX` +- `SHOULD_HAVE_INPUT_SUFFIX` +- `SHOULD_NOT_HAVE_ENUM_PREFIX/SUFFIX` +- `SHOULD_NOT_HAVE_INTERFACE_PREFIX/SUFFIX` +- `ENUM_VALUES_SHOULD_BE_UPPER_CASE` + +**Alphabetical Sort Rules:** +- `ORDER_FIELDS` +- `ORDER_ENUM_VALUES` +- `ORDER_DEFINITIONS` + +**Other Rules:** +- `ALL_TYPES_REQUIRE_DESCRIPTION` +- `DISALLOW_CASE_INSENSITIVE_ENUM_VALUES` +- `NO_TYPENAME_PREFIX_IN_TYPE_FIELDS` +- `REQUIRE_DEPRECATION_REASON` +- `REQUIRE_DEPRECATION_DATE` + +### Severity Levels +- **Error**: Violations cause the check operation to fail +- **Warning**: Violations are flagged but don't fail the check + +### Integration Points +- Cosmo Studio (configuration UI) +- CLI `wgc subgraph check` command +- CI/CD pipelines + +### Requirements & Prerequisites +- Cosmo account +- Namespace configured in Cosmo + +--- + +## Documentation References + +- Primary docs: `/docs/studio/policies` +- Linter rules reference: `/docs/studio/lint-policy/linter-rules` +- Schema checks: `/docs/cli/subgraph/check` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL schema linting +- Schema conventions +- API style guide + +### Secondary Keywords +- Naming conventions enforcement +- Schema documentation requirements +- Deprecation policies + +### Related Search Terms +- GraphQL linter rules +- Enforce schema naming conventions +- GraphQL style guide automation + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/developer-experience/query-plan-visualization.md b/capabilities/developer-experience/query-plan-visualization.md new file mode 100644 index 00000000..d98e1e78 --- /dev/null +++ b/capabilities/developer-experience/query-plan-visualization.md @@ -0,0 +1,146 @@ +# Query Plan Visualization + +Visual query execution plans for debugging and understanding federated query routing. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-dx-006` | +| **Category** | Developer Experience | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-dx-001` | + +--- + +## Quick Reference + +### Name +Query Plan Visualization + +### Tagline +See how your queries execute across subgraphs. + +### Elevator Pitch +Cosmo's Query Plan Visualization shows developers exactly how the router will execute a federated GraphQL query. View the query plan directly in the playground to understand subgraph routing, execution order, and data fetching strategies before your query even runs. Debug query behavior and optimize performance with complete visibility into federation mechanics. + +--- + +## Problem & Solution + +### The Problem +Federated GraphQL queries are executed across multiple subgraphs, but developers have no visibility into how the router plans this execution. When queries behave unexpectedly or perform poorly, there's no way to understand the underlying routing logic. Debugging requires guesswork and trial-and-error. + +### The Solution +Query Plan Visualization exposes the router's internal query plan directly in the playground. By including the `X-WG-Include-Query-Plan` header, developers receive the complete execution plan in the response extensions. The plan shows which subgraphs will be called, in what order, and how data will be fetched and merged. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Query routing is a black box | Full visibility into execution plan | +| Debugging requires guesswork | See exact subgraph calls before execution | +| No insight into fetch strategy | Understand parallel vs sequential execution | +| Performance issues hard to diagnose | Identify routing inefficiencies upfront | + +--- + +## Key Benefits + +1. **Execution Transparency**: See exactly which subgraphs will be called and in what order +2. **Pre-Execution Analysis**: View the plan without actually making subgraph requests using `X-WG-Skip-Loader` +3. **Performance Insight**: Understand parallel vs sequential execution before running queries +4. **Debug Without Traces**: Analyze plans without generating trace data using `X-WG-Disable-Tracing` +5. **Playground Integration**: View plans directly in Cosmo Studio's playground interface + +--- + +## Target Audience + +### Primary Persona +- **Role**: GraphQL Developer / Platform Engineer +- **Pain Points**: Understanding federation routing; debugging unexpected query behavior; optimizing query performance +- **Goals**: Understand how queries execute; identify optimization opportunities; debug federation issues + +### Secondary Personas +- Performance engineers analyzing query efficiency +- Developers learning federation concepts +- DevOps teams troubleshooting production queries + +--- + +## Use Cases + +### Use Case 1: Understanding Query Routing +**Scenario**: A developer is new to federation and wants to understand how a complex query will be routed across 4 subgraphs +**How it works**: The developer writes the query in the playground, adds the `X-WG-Include-Query-Plan` header, and runs the query. The extensions field in the response contains the complete query plan showing each subgraph call. +**Outcome**: Developer understands federation routing without reading federation internals documentation + +### Use Case 2: Pre-Execution Performance Analysis +**Scenario**: Before running an expensive query in production, an engineer wants to understand its execution plan without generating traffic +**How it works**: The engineer uses both `X-WG-Include-Query-Plan` and `X-WG-Skip-Loader` headers. This returns the query plan but skips actual subgraph requests (data returns as null). +**Outcome**: Complete execution plan analysis with zero production impact + +### Use Case 3: Debugging Query Inefficiencies +**Scenario**: A query is making more subgraph calls than expected, and the team needs to understand why +**How it works**: The team examines the query plan to see the exact sequence of fetches. They discover that a field dependency is causing an extra round-trip that could be eliminated with schema changes. +**Outcome**: Schema optimized to reduce subgraph calls and improve query performance + +--- + +## Technical Summary + +### How It Works +When the `X-WG-Include-Query-Plan` header is included in a request, the Cosmo Router includes the query plan in the response's extensions field. The plan describes the fetch operations, their dependencies, and which subgraphs they target. Additional headers provide control over execution behavior. + +### Key Technical Features +- `X-WG-Include-Query-Plan`: Request query plan in response extensions +- `X-WG-Skip-Loader`: Skip subgraph requests, return null data (for plan-only analysis) +- `X-WG-Disable-Tracing`: Exclude from tracing (avoid trace noise) +- Plan shows fetch operations and dependencies +- Integrated visualization in Cosmo Studio playground + +### Integration Points +- Cosmo Router +- Cosmo Studio Playground + +### Requirements & Prerequisites +- Cosmo Router deployed +- Access to Cosmo Studio playground + +--- + +## Documentation References + +- Primary docs: `/docs/router/query-plan` +- Playground overview: `/docs/studio/playground` + +--- + +## Keywords & SEO + +### Primary Keywords +- Query plan visualization +- GraphQL execution plan +- Federation query debugging + +### Secondary Keywords +- Subgraph routing +- Query optimization +- Federation debugging + +### Related Search Terms +- How to see GraphQL query plan +- Debug federated GraphQL queries +- Understand federation execution + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/developer-experience/schema-explorer.md b/capabilities/developer-experience/schema-explorer.md new file mode 100644 index 00000000..d8b5fe4b --- /dev/null +++ b/capabilities/developer-experience/schema-explorer.md @@ -0,0 +1,149 @@ +# Schema Explorer + +Interactive schema browsing with search, usage tracking, and authentication visibility. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-dx-004` | +| **Category** | Developer Experience | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-dx-005`, `cap-dx-008` | + +--- + +## Quick Reference + +### Name +Schema Explorer + +### Tagline +Navigate your entire federated schema interactively. + +### Elevator Pitch +Cosmo's Schema Explorer provides an interactive interface to browse your entire federated GraphQL schema. Navigate between types, view field details, search instantly with keyboard shortcuts, and see real-world usage data for every field. Identify deprecated fields, understand authentication requirements, and explore your schema without switching contexts. + +--- + +## Problem & Solution + +### The Problem +Understanding a large federated GraphQL schema is challenging. Developers struggle to find types, track field usage, and understand which fields require authentication. Schema documentation is often outdated, and there's no easy way to see how the schema has evolved or which fields are safe to deprecate. + +### The Solution +The Schema Explorer provides a living, interactive view of your federated schema. Navigate from Query to nested types with clicks, search for any type with `Cmd/Ctrl + K`, and see real usage metrics alongside field definitions. View all deprecated fields in one place with their usage data, and instantly identify which fields require authentication scopes. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Read raw SDL files to understand schema | Interactive navigation between types | +| No visibility into field usage | Usage metrics for every field | +| Search through files for types | Instant search with keyboard shortcut | +| Unclear which fields need auth | Authentication requirements displayed inline | + +--- + +## Key Benefits + +1. **Interactive Navigation**: Click through types, interfaces, unions, and enums to explore your schema naturally +2. **Instant Search**: `Cmd/Ctrl + K` opens search modal to jump to any type immediately +3. **Usage Tracking**: See real-world usage data for every field, powered by analytics +4. **Deprecated Fields View**: All deprecated fields in one place with their usage metrics +5. **Authentication Visibility**: View `@authenticated` and `@requiresScopes` directives with scope details + +--- + +## Target Audience + +### Primary Persona +- **Role**: GraphQL Developer / API Consumer +- **Pain Points**: Finding types in large schemas; understanding field usage; knowing auth requirements +- **Goals**: Quickly understand available fields; make informed decisions about field usage; know what auth is needed + +### Secondary Personas +- Frontend developers discovering available API fields +- Platform engineers reviewing schema structure +- API designers planning schema evolution + +--- + +## Use Cases + +### Use Case 1: Schema Discovery +**Scenario**: A frontend developer needs to find all available fields for building a user profile page +**How it works**: The developer opens Schema Explorer, uses `Cmd + K` to search for "User", and navigates through the User type to see all available fields, their types, and descriptions +**Outcome**: Developer quickly identifies the exact fields needed without reading SDL files or asking teammates + +### Use Case 2: Safe Deprecation Planning +**Scenario**: A platform team wants to deprecate a field but needs to know if it's still in use +**How it works**: The team views the deprecated fields list, checks usage metrics for the field in question, and sees it still has significant traffic +**Outcome**: Team decides to communicate deprecation to consumers before removal, avoiding breaking changes + +### Use Case 3: Understanding Authentication Requirements +**Scenario**: A developer is building a feature and needs to know which fields require specific auth scopes +**How it works**: The developer opens the Authenticated Types and Fields view, finds the relevant types, and clicks "View scopes" to see the required scopes for each protected field +**Outcome**: Developer implements correct auth handling before making API calls, avoiding auth errors + +--- + +## Technical Summary + +### How It Works +The Schema Explorer parses the composed federated schema and renders it as an interactive UI. Each type links to its field types, enabling click-through navigation. Usage data is pulled from Cosmo's analytics pipeline and displayed alongside field definitions. Authentication directives are extracted from the router schema and presented in dedicated views. + +### Key Technical Features +- Full type navigation: objects, interfaces, enums, unions, inputs +- Field details including arguments and descriptions +- `Cmd/Ctrl + K` global search +- Schema field usage metrics integration +- Deprecated fields aggregated view +- `@authenticated` and `@requiresScopes` directive visibility +- Scope requirements expandable per field + +### Integration Points +- Cosmo Studio +- Cosmo Analytics (for usage data) +- Router Schema (for auth directives) + +### Requirements & Prerequisites +- Federated graph deployed to Cosmo +- Analytics enabled for usage tracking (optional but recommended) + +--- + +## Documentation References + +- Primary docs: `/docs/studio/schema-explorer` +- Field usage analytics: `/docs/studio/analytics/schema-field-usage` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL schema explorer +- Schema documentation +- Field usage tracking + +### Secondary Keywords +- Interactive schema browser +- Deprecated field management +- GraphQL authentication directives + +### Related Search Terms +- How to explore GraphQL schema +- GraphQL field usage analytics +- Schema documentation tool + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/developer-experience/shared-playground-state.md b/capabilities/developer-experience/shared-playground-state.md new file mode 100644 index 00000000..17694089 --- /dev/null +++ b/capabilities/developer-experience/shared-playground-state.md @@ -0,0 +1,155 @@ +# Shared Playground State + +Shareable playground sessions for team collaboration and reproducible debugging. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-dx-003` | +| **Category** | Developer Experience | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-dx-001`, `cap-dx-002` | + +--- + +## Quick Reference + +### Name +Shared Playground State + +### Tagline +Share complete GraphQL sessions with a single URL. + +### Elevator Pitch +Cosmo's Shared Playground State lets you share complete GraphQL playground sessions - including queries, variables, and headers - with teammates using a single URL. No more copying and pasting queries in Slack or explaining how to reproduce an issue. Share the exact context and collaborate instantly. + +--- + +## Problem & Solution + +### The Problem +Sharing GraphQL queries for collaboration or debugging is tedious. Developers copy queries into chat, forget to include variables, lose header configurations, and spend time re-explaining context. Bug reports lack reproducibility, and onboarding new developers means walking them through query setup manually. + +### The Solution +Shared Playground State generates a URL that encodes the complete playground session. Recipients open the link and see the exact query, variables, and headers the sender configured. The session opens in a new tab, ready to execute, with full context preserved. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Copy/paste queries across chat tools | Share a single URL | +| Variables and headers often missing | Complete context included | +| Bug reports lack reproducibility | Exact reproduction every time | +| Manual onboarding for query examples | Link directly to example queries | + +--- + +## Key Benefits + +1. **One-Click Sharing**: Generate a shareable URL with a single click from the playground toolbar +2. **Complete Context**: Include operations, variables, and headers in the shared link +3. **Instant Reproduction**: Recipients see the exact session state without manual setup +4. **Team Collaboration**: Speed up debugging, code reviews, and knowledge sharing +5. **Documentation Integration**: Link to example queries in internal documentation or wikis + +--- + +## Target Audience + +### Primary Persona +- **Role**: GraphQL Developer / Support Engineer +- **Pain Points**: Explaining query context repeatedly; bug reports that can't be reproduced; slow onboarding +- **Goals**: Collaborate efficiently; share reproducible examples; accelerate debugging + +### Secondary Personas +- Technical writers creating API documentation +- Developer advocates sharing examples +- Engineering managers reviewing query patterns + +--- + +## Use Cases + +### Use Case 1: Bug Reproduction +**Scenario**: A developer discovers a query returning unexpected results and needs to share the exact scenario with a teammate +**How it works**: The developer clicks the Share icon in the playground toolbar, selects which elements to include (operation, variables, headers), and copies the generated link. The teammate opens the link and sees the exact query ready to run. +**Outcome**: Bug is reproduced instantly; debugging time reduced significantly + +### Use Case 2: Developer Onboarding +**Scenario**: A new team member needs to learn how to query the product catalog with proper authentication headers +**How it works**: A senior developer creates a reference query with correct headers and variables, generates a share link, and adds it to the team's internal documentation +**Outcome**: New developers have working examples they can execute immediately, reducing ramp-up time + +### Use Case 3: Support Collaboration +**Scenario**: A customer reports an API issue and the support team needs to hand off to engineering with full context +**How it works**: Support recreates the customer's query in the playground, generates a share link with all relevant context, and includes it in the engineering ticket +**Outcome**: Engineers can reproduce the issue immediately without back-and-forth questions + +--- + +## Technical Summary + +### How It Works +When sharing a playground session, the selected state (operation, variables, headers) is compressed and encoded into the URL. The URL can be shared with anyone who has access to the same Cosmo Studio organization. When opened, the playground restores the encoded state into a new tab. + +### Key Technical Features +- Selective sharing: choose which elements to include +- GraphQL operations always included (required) +- Variables and headers optional +- Compressed URL encoding +- New tab restoration + +### What's Not Included +- Pre-flight scripts +- Pre-operation scripts +- Post-operation scripts + +### Security Considerations +- Headers are encoded but accessible to anyone with the link +- Avoid including sensitive credentials in shared headers + +### Integration Points +- Cosmo Studio Playground +- Any URL-sharing mechanism (Slack, email, documentation) + +### Requirements & Prerequisites +- Cosmo Studio account +- Recipients need access to the same organization/graph + +--- + +## Documentation References + +- Primary docs: `/docs/studio/playground/shared-playground-state` +- Playground overview: `/docs/studio/playground` + +--- + +## Keywords & SEO + +### Primary Keywords +- Share GraphQL queries +- Playground collaboration +- GraphQL session sharing + +### Secondary Keywords +- Reproducible bug reports +- Team collaboration GraphQL +- GraphQL onboarding + +### Related Search Terms +- How to share GraphQL playground state +- Collaborate on GraphQL queries +- Share GraphQL request context + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/extensibility/custom-modules.md b/capabilities/extensibility/custom-modules.md new file mode 100644 index 00000000..31e7c9b0 --- /dev/null +++ b/capabilities/extensibility/custom-modules.md @@ -0,0 +1,157 @@ +# Custom Modules (Go) + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ext-001` | +| **Category** | Extensibility | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-ext-002` | + +--- + +## Quick Reference + +### Name +Custom Modules (Go) + +### Tagline +Extend router functionality with pure Go code. + +### Elevator Pitch +Custom Modules allow you to extend the Cosmo Router by writing pure Go code that hooks into the request lifecycle. Implement custom authentication, caching, logging, header manipulation, and more without complex scripting or external proxies. Leverage the entire Go ecosystem to customize exactly how your GraphQL gateway behaves. + +--- + +## Problem & Solution + +### The Problem +Teams often need to customize their GraphQL gateway beyond what standard configuration allows. Whether implementing custom authentication logic, adding proprietary caching layers, integrating with internal systems, or enforcing company-specific policies, organizations find themselves stuck between limited configuration options and building a gateway from scratch. External proxy solutions add latency and operational complexity. + +### The Solution +Cosmo's Custom Modules provide a clean extension API that lets you write pure Go code to intercept and modify requests at every stage of the request lifecycle. Multiple hook interfaces give you precise control over when your code executes, from early request validation to post-subgraph response handling. A single compilation command produces your extended router binary. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| External proxy layers adding latency | Direct in-router extensions with zero network hop | +| Complex scripting languages with limited capabilities | Full Go ecosystem and native performance | +| Limited hook points requiring workarounds | Six distinct interfaces for precise lifecycle control | +| Difficult module testing and debugging | Standard Go testing and debugging workflows | + +--- + +## Key Benefits + +1. **Native Performance**: Extensions run as compiled Go code within the router process, eliminating external proxy overhead and network latency. +2. **Full Request Lifecycle Control**: Six hook interfaces let you intercept requests at exactly the right moment - from early authentication to post-subgraph response processing. +3. **Leverage Go Ecosystem**: Use any Go library for authentication, caching, metrics, or integration with your existing systems. +4. **Type-Safe Development**: Go's strong typing catches errors at compile time, and the well-defined interfaces make extension development straightforward. +5. **Production-Ready Patterns**: Access to request context, GraphQL operation details, authentication info, and query plan statistics enables sophisticated production use cases. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Backend Developer +- **Pain Points**: Need to customize gateway behavior beyond configuration; current solutions require external proxies or complex workarounds; want to integrate with existing Go-based infrastructure. +- **Goals**: Implement custom authentication, add proprietary caching, enforce company policies, integrate with internal systems. + +### Secondary Personas +- Security Engineers implementing custom authentication and authorization logic +- DevOps Engineers adding custom logging, metrics, and observability +- Integration Specialists connecting the gateway to internal enterprise systems + +--- + +## Use Cases + +### Use Case 1: Custom Authentication Logic +**Scenario**: A company uses a proprietary identity system that doesn't follow standard OAuth/JWT patterns and needs to authenticate requests before they reach subgraphs. +**How it works**: Implement `RouterOnRequestHandler` to intercept requests before the router's built-in authentication. Extract credentials, validate against the internal system, and either allow the request to proceed or return an early error response. +**Outcome**: Seamless integration with existing identity infrastructure without modifying subgraphs or adding external proxy layers. + +### Use Case 2: Response Caching Layer +**Scenario**: An e-commerce platform wants to cache certain expensive GraphQL queries at the gateway level to reduce subgraph load during high-traffic periods. +**How it works**: Use `EnginePreOriginHandler` to check cache before subgraph requests and `EnginePostOriginHandler` to populate cache after responses. Access operation hash and query plan stats to make intelligent caching decisions. +**Outcome**: Significant reduction in subgraph load and improved response times for cacheable operations. + +### Use Case 3: Request Validation and Rate Limiting +**Scenario**: A SaaS platform needs to enforce per-tenant rate limits and validate that operations conform to tenant-specific policies. +**How it works**: Implement `RouterMiddlewareHandler` to access the GraphQL operation details and query plan statistics. Use `QueryPlanStats` to estimate operation cost based on subgraph fetches, then apply tenant-specific rate limiting logic. +**Outcome**: Fair resource allocation across tenants with protection against expensive queries, all enforced at the gateway layer. + +### Use Case 4: Header Propagation and Transformation +**Scenario**: A microservices environment requires specific headers to be propagated to subgraphs, with transformations based on the target service. +**How it works**: Use `EnginePreOriginHandler` to access `ctx.ActiveSubgraph()` and conditionally add, modify, or remove headers based on the destination subgraph. +**Outcome**: Clean header management without subgraph modifications, enabling consistent tracing and authentication across services. + +--- + +## Technical Summary + +### How It Works +Custom Modules are pure Go code that implement one or more of six predefined interfaces. When you build your router with custom modules, they're compiled into the router binary. At runtime, modules are instantiated and their handlers are called at the appropriate points in the request lifecycle. Modules can be prioritized to control loading order, and configuration values can be passed via the router's YAML config file. + +### Key Technical Features +- Six hook interfaces: `RouterOnRequestHandler`, `RouterMiddlewareHandler`, `EnginePreOriginHandler`, `EnginePostOriginHandler`, `Provisioner`, `Cleaner` +- Access to GraphQL operation details: name, type, hash, content, query plan stats +- Request context for sharing data across handlers +- Subgraph information access including name, ID, and URL +- Authentication information access and modification +- Configurable module priority for controlled loading order +- YAML-based module configuration with struct tag mapping + +### Integration Points +- Go ecosystem libraries and frameworks +- External authentication systems +- Caching systems (Redis, Memcached, etc.) +- Logging and metrics platforms +- Internal enterprise APIs + +### Requirements & Prerequisites +- Go development environment +- Familiarity with Go interfaces and HTTP handlers +- Access to router source for building custom binary + +--- + +## Documentation References + +- Primary docs: `/docs/router/custom-modules` +- Examples repository: https://github.com/wundergraph/router-examples +- Custom module with tests: https://github.com/wundergraph/cosmo/tree/main/router/cmd/custom +- Custom JWT example: https://github.com/wundergraph/cosmo/tree/main/router/cmd/custom-jwt +- ADR for future module system: https://github.com/wundergraph/cosmo/blob/main/adr/custom-modules-v1.md + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL router extension +- Custom Go modules +- Router middleware + +### Secondary Keywords +- GraphQL gateway customization +- Request lifecycle hooks +- Router plugins + +### Related Search Terms +- How to extend GraphQL router +- Custom authentication GraphQL gateway +- Go GraphQL middleware +- Router request interceptor + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/extensibility/subgraph-check-extensions.md b/capabilities/extensibility/subgraph-check-extensions.md new file mode 100644 index 00000000..75c6e624 --- /dev/null +++ b/capabilities/extensibility/subgraph-check-extensions.md @@ -0,0 +1,156 @@ +# Subgraph Check Extensions + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-ext-002` | +| **Category** | Extensibility | +| **Status** | GA | +| **Availability** | Pro / Enterprise | +| **Related Capabilities** | `cap-ext-001` | + +--- + +## Quick Reference + +### Name +Subgraph Check Extensions + +### Tagline +Custom validation logic for schema changes on your terms. + +### Elevator Pitch +Subgraph Check Extensions let you hook into Cosmo's schema check pipeline with your own validation logic. When developers propose schema changes, your custom endpoint receives detailed information about the change and can return additional lint issues or errors that block deployment. Enforce company standards, integrate with external systems, and ensure every schema change meets your requirements before it reaches production. + +--- + +## Problem & Solution + +### The Problem +Organizations have unique policies and standards for their GraphQL schemas that go beyond generic linting rules. They may need to enforce naming conventions, validate against external systems, coordinate between teams when breaking changes occur, or integrate schema validation with internal compliance tools. Built-in validation tools, while powerful, cannot anticipate every organization's specific requirements. + +### The Solution +Cosmo's Subgraph Check Extensions send a webhook to your custom endpoint whenever a schema check runs. Your service receives comprehensive information about the proposed changes - including the schema SDL, detected lint issues, schema changes, and affected operations. You can return custom lint issues that appear in the check results, or return errors that cause the check to fail. This enables unlimited customization of the validation pipeline while keeping your logic on your own infrastructure. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Generic linting rules that miss company-specific requirements | Custom validation enforcing exact organizational standards | +| No integration with internal compliance systems | Webhook-based integration with any external system | +| Breaking changes deployed without team coordination | Automated notifications to affected teams before deployment | +| Schema standards enforced through manual review | Automated enforcement with clear feedback in CI/CD | + +--- + +## Key Benefits + +1. **Unlimited Custom Validation**: Write any validation logic in any language - check naming conventions, enforce deprecation policies, validate against external systems, or implement custom business rules. +2. **Seamless CI/CD Integration**: Extensions run automatically during schema checks, blocking non-compliant changes before they can be merged or deployed. +3. **Full Context Available**: Receive the complete picture - schema SDL (before and after), lint issues, schema changes, affected operations, VCS context, and more. +4. **Secure Communication**: HMAC signatures verify that requests originate from Cosmo and haven't been tampered with, ensuring your endpoint processes only legitimate requests. +5. **Flexible Response Options**: Return custom lint issues that appear alongside built-in checks, or return errors that explicitly fail the check - you control the severity. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Governance Lead +- **Pain Points**: Cannot enforce organization-specific schema standards automatically; need to integrate schema validation with internal tools; manual review process doesn't scale. +- **Goals**: Automate enforcement of API standards; integrate with internal compliance and notification systems; reduce manual review burden while maintaining quality. + +### Secondary Personas +- Security Engineers validating schema changes against security policies +- Architecture Teams ensuring schemas follow design patterns +- DevOps Engineers integrating schema checks with CI/CD pipelines + +--- + +## Use Cases + +### Use Case 1: Custom Naming Convention Enforcement +**Scenario**: A company requires all GraphQL types and fields to follow specific naming conventions (e.g., camelCase for fields, PascalCase for types, specific prefixes for certain domains). +**How it works**: Configure the extension to receive schema SDL. Your service parses the schema, validates naming patterns, and returns `LintIssue` objects with `lintRuleType`, `severity`, `message`, and `issueLocation` for any violations. +**Outcome**: Developers receive immediate, precise feedback on naming violations directly in their check results, with exact line and column numbers for each issue. + +### Use Case 2: Breaking Change Notifications +**Scenario**: When a backend team introduces a breaking change, the frontend team needs to be notified so they can prepare for the update. +**How it works**: Enable "Include Schema Changes" in the configuration. Your service analyzes the schema changes for breaking modifications, and when detected, sends notifications to the affected teams via Slack, email, or internal systems. +**Outcome**: Cross-team coordination happens automatically, reducing surprise breaking changes and enabling smoother deployments. + +### Use Case 3: External Compliance Validation +**Scenario**: All API changes must be validated against an internal compliance system before deployment to production environments. +**How it works**: Your extension endpoint forwards the schema information to the compliance system, awaits validation results, and translates any compliance failures into check errors that block deployment. +**Outcome**: Compliance validation is seamlessly integrated into the development workflow, catching issues before they reach production. + +### Use Case 4: Deprecation Policy Enforcement +**Scenario**: The organization requires that deprecated fields remain available for a minimum period and that deprecation reasons follow a specific format including sunset dates. +**How it works**: Parse the schema SDL to find `@deprecated` directives, validate that deprecation reasons contain required information, and check against historical data to ensure minimum deprecation periods. +**Outcome**: Consistent deprecation practices across all subgraphs, with clear communication to API consumers about upcoming changes. + +--- + +## Technical Summary + +### How It Works +When a subgraph check runs in a namespace with extensions enabled, Cosmo sends a POST request to your configured endpoint. The JSON payload contains detailed information about the check, including organization context, namespace, VCS information (when applicable), affected graphs, and subgraph details. A downloadable URL provides bulk data including SDL versions and lint issues. Your endpoint responds with 204 (no action needed) or 200 with optional lint issues and errors. HMAC signatures in the `X-Cosmo-Signature-256` header allow you to verify request authenticity. + +### Key Technical Features +- Configurable data inclusion: SDL, lint issues, graph pruning issues, schema changes, affected operations +- Rich payload with organization, namespace, VCS context, and subgraph information +- Bulk data file accessible for 5 minutes containing detailed schema and lint information +- HMAC-SHA256 signature verification for secure webhook communication +- Flexible response: 204 for pass-through, 200 with custom lint issues and/or errors +- Lint issues include precise location (line, column) for highlighting in UI + +### Integration Points +- Any HTTP endpoint capable of receiving webhooks +- VCS systems (GitHub, GitLab, etc.) through VCS context data +- Internal compliance and governance systems +- Team notification systems (Slack, Teams, email) +- Custom linting and validation frameworks + +### Requirements & Prerequisites +- HTTP endpoint accessible from Cosmo's control plane +- Secret key for HMAC signature verification +- Namespace-level configuration in Cosmo Studio + +--- + +## Documentation References + +- Primary docs: `/docs/studio/subgraph-check-extensions` +- Request payload structure: `/docs/studio/sce/request-payload-structure` +- Response structure: `/docs/studio/sce/response-structure` +- Handler example: `/docs/studio/sce/handler-example` +- File content details: `/docs/studio/sce/file-content` + +--- + +## Keywords & SEO + +### Primary Keywords +- Schema validation webhook +- Custom schema checks +- GraphQL linting extension + +### Secondary Keywords +- Subgraph validation +- Schema governance +- API compliance automation + +### Related Search Terms +- Custom GraphQL schema validation +- Schema check webhook integration +- Automated API governance +- GraphQL CI/CD validation + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/feature-flags/feature-flags.md b/capabilities/feature-flags/feature-flags.md new file mode 100644 index 00000000..8fc646b9 --- /dev/null +++ b/capabilities/feature-flags/feature-flags.md @@ -0,0 +1,230 @@ +# Feature Flags + +Use this template to document each capability in the Cosmo platform. The goal is to provide enough information for marketing, sales, and product teams to create landing pages, pitch decks, battle cards, and other go-to-market materials. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-feature-flags` | +| **Category** | Feature Flags | +| **Status** | GA | +| **Availability** | Pro, Enterprise | +| **Related Capabilities** | `cap-federation`, `cap-schema-registry` | + +--- + +## Quick Reference + +### Name +Feature Flags & Progressive Delivery + +### Tagline +Gradually roll out GraphQL changes with runtime feature toggles. + +### Elevator Pitch +Feature Flags enable you to release schema changes and experimental features incrementally to a subset of your consumer traffic, rather than all clients immediately. Using feature subgraphs as toggle-able replacements for base subgraphs, you can control which users see new features based on headers, JWT claims, or cookies—all without deploying new router versions. + +--- + +## Problem & Solution + +### The Problem +Releasing new features or schema changes in a federated GraphQL architecture is risky. A single breaking change or performance regression can affect all clients immediately. Teams lack the ability to test changes with real production traffic, gradually roll out features to specific user segments, or quickly disable problematic features without a full redeployment. This leads to slower release cycles, increased risk, and difficulty coordinating changes across multiple subgraphs. + +### The Solution +Cosmo's Feature Flags provide runtime toggles that let you activate alternative subgraph implementations—called feature subgraphs—for specific requests. Based on request context (headers, JWT claims, or cookies), different users can see different graph compositions. This enables gradual rollouts, A/B testing, shadow mode comparisons, and instant rollbacks—all without changing deployed infrastructure. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Deploy changes to all users at once | Gradually roll out to 1%, 10%, 50%, then 100% of users | +| No way to test schema changes in production safely | Shadow mode testing with real traffic before enabling | +| Breaking changes require emergency rollbacks | Disable feature flag instantly without redeployment | +| Separate staging environments for each feature | Shared staging with per-developer feature isolation | + +--- + +## Key Benefits + +1. **Zero-Downtime Feature Rollout**: Enable or disable features for specific user segments without deploying new router versions or modifying infrastructure. + +2. **Safe Schema Evolution**: Test schema changes with real production traffic in shadow mode before exposing them to users, comparing correctness and performance. + +3. **Instant Rollback**: If a feature causes issues, disable the feature flag immediately—no deployment or code changes required. + +4. **Personalized Experiences**: Serve different graph compositions to different users based on headers, JWT claims, or cookies, enabling A/B testing and personalization. + +5. **Shared Staging Environments**: Multiple developers can test their features in isolation on a shared staging environment using unique feature flag identifiers. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Team Lead +- **Pain Points**: Coordinating schema changes across teams is complex; rollbacks require redeployments; no way to gradually test changes in production +- **Goals**: Ship features faster with lower risk; enable teams to work independently; maintain API stability during transitions + +### Secondary Personas +- Backend developers wanting to test changes with real traffic before full rollout +- Engineering managers needing visibility into feature rollout progress +- DevOps engineers looking for safer deployment strategies + +--- + +## Use Cases + +### Use Case 1: Monolith to Federation Migration +**Scenario**: A team is migrating from a monolithic GraphQL API to a federated architecture and needs to validate that the new subgraphs perform correctly before switching traffic. + +**How it works**: +1. Create a feature subgraph that overrides fields from the monolith using the `@override` directive +2. Enable shadow mode to route traffic to both implementations and compare results +3. Gradually increase traffic percentage to the new subgraph while monitoring performance +4. Once confident, publish the schema change without the feature flag + +**Outcome**: Migration completed with zero downtime and full confidence in correctness and performance parity. + +### Use Case 2: Experimental Feature Rollout +**Scenario**: A product team wants to release a new recommendations engine to premium users first, then expand based on feedback. + +**How it works**: +1. Create a feature subgraph with the new recommendations implementation +2. Create a feature flag that activates based on a JWT claim indicating premium subscription +3. Monitor performance and gather feedback from the premium user segment +4. Expand to additional user segments by updating the feature flag criteria + +**Outcome**: New feature validated with real users before broad rollout, with ability to iterate based on feedback. + +### Use Case 3: Developer Staging Isolation +**Scenario**: Multiple developers need to test their changes in a shared staging environment without affecting each other's work. + +**How it works**: +1. Each developer creates a feature subgraph for their changes +2. Developers set a unique feature flag header or cookie in their client when testing +3. Their requests use their feature subgraph while others use the base graph +4. Features can be tested end-to-end without dedicated infrastructure per developer + +**Outcome**: Faster development cycles with reduced infrastructure costs and no staging environment conflicts. + +--- + +## Competitive Positioning + +### Key Differentiators +1. **Native Federation Integration**: Feature flags work at the subgraph composition level, not just field resolution—enabling true schema evolution +2. **Flexible Activation Methods**: Activate via headers, JWT claims, or cookies—supporting diverse architectural patterns +3. **Shadow Mode Comparison**: Compare feature subgraph results against base implementation before enabling + +### Comparison with Alternatives + +| Aspect | Cosmo Feature Flags | Traditional Feature Flags | Manual Traffic Splitting | +|--------|---------------------|---------------------------|--------------------------| +| Schema-aware | Yes | No | No | +| Federation-native | Yes | No | Partial | +| Runtime activation | Headers, JWT, Cookies | SDK calls | Load balancer rules | +| Rollback speed | Instant | Code change required | Configuration change | +| Shadow mode | Built-in | Custom implementation | Not available | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We already use LaunchDarkly/Split" | Cosmo Feature Flags are schema-aware and work at the federation composition level—complementing app-level feature flags rather than replacing them | +| "This adds complexity" | Feature flags reduce complexity by eliminating the need for separate deployments, environments, and rollback procedures | +| "How do we know which flag is active?" | Full observability integration shows which feature flags are active for each request in traces and analytics | + +--- + +## Technical Summary + +### How It Works +Feature subgraphs are alternative implementations of base subgraphs in your federated graph. When a feature flag is enabled and activated (via header, JWT claim, or cookie), the router composes the graph using the feature subgraph instead of the base subgraph for that request. This happens at the routing layer, so no changes to subgraph code are required beyond publishing the alternative schema. + +### Key Technical Features +- Feature subgraphs as "overrides" for base subgraphs +- Label-based matching to federated graphs +- Header, JWT claim, and cookie-based activation +- Shadow mode for result comparison +- Atomic enable/disable without redeployment + +### Integration Points +- Cosmo Router (minimum v0.95.0) +- wgc CLI (minimum v0.58.0) +- Any load balancer supporting custom headers for traffic splitting +- JWT-based authentication systems + +### Requirements & Prerequisites +- Cosmo Router v0.95.0 or later +- wgc CLI v0.58.0 or later +- Existing federated graph with base subgraphs + +--- + +## Proof Points + +### Metrics & Benchmarks +- Zero additional latency for feature flag evaluation (computed at composition time) +- Instant feature flag toggle propagation (< 1 second) +- Support for unlimited feature flags per federated graph + +### Customer Quotes +> "Feature flags let us migrate our monolith to federation without any downtime or risk to our users." — Platform Engineering Team + +### Case Studies +- Monolith-to-federation migration with shadow mode validation +- Multi-team schema evolution coordination + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/concepts/feature-flags` +- CLI Feature Flags commands: `/docs/cli/feature-flags` +- CLI Feature Subgraph commands: `/docs/cli/feature-subgraph` +- Tutorial: `/docs/tutorial/gradual-and-experimental-feature-rollout-with-feature-flags` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL feature flags +- Federation feature flags +- Progressive delivery GraphQL + +### Secondary Keywords +- Feature subgraphs +- Schema evolution +- Gradual rollout + +### Related Search Terms +- GraphQL canary deployments +- Federation traffic splitting +- GraphQL A/B testing +- Safe schema migrations + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2026-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/federation/federation-directives.md b/capabilities/federation/federation-directives.md new file mode 100644 index 00000000..417a0971 --- /dev/null +++ b/capabilities/federation/federation-directives.md @@ -0,0 +1,201 @@ +# Federation Directives + +Extended directive support including @shareable, @authenticated, @requiresScopes, and more for advanced federation scenarios and security policies. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-fed-006` | +| **Category** | Federation | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-fed-001`, `cap-fed-002`, `cap-fed-005` | + +--- + +## Quick Reference + +### Name +Federation Directives + +### Tagline +Complete directive support for sophisticated federation patterns. + +### Elevator Pitch +WunderGraph Cosmo supports a comprehensive set of federation directives for building sophisticated distributed GraphQL architectures. From core federation patterns (@key, @external, @requires) to advanced authorization (@authenticated, @requiresScopes) and cross-subgraph field sharing (@shareable), Cosmo provides the building blocks for enterprise-grade federated APIs with built-in security. + +--- + +## Problem & Solution + +### The Problem +Building federated GraphQL architectures requires expressing complex relationships between subgraphs: entity resolution, field dependencies, cross-service sharing, and access control. Without comprehensive directive support, teams resort to workarounds, custom middleware, or are blocked from implementing required patterns. Authorization is particularly challenging, requiring custom code in every subgraph. + +### The Solution +Cosmo supports the full spectrum of federation directives plus extensions for common enterprise needs: +- **Core Federation**: @key, @external, @requires, @provides, @extends for entity relationships +- **Field Sharing**: @shareable, @override for multi-subgraph field resolution +- **Visibility Control**: @inaccessible, @tag for schema filtering +- **Authorization**: @authenticated, @requiresScopes for declarative security policies +- **Subscription Filtering**: @openfed__subscriptionFilter for event filtering + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Limited directive support | Full v1 and v2 directive compatibility | +| Custom authorization middleware | Declarative @authenticated and @requiresScopes | +| Complex field sharing logic | Simple @shareable directive | +| No subscription filtering | @openfed__subscriptionFilter for real-time | + +--- + +## Key Benefits + +1. **Complete Federation Support**: All Apollo Federation v1 and v2 directives fully implemented +2. **Declarative Authorization**: @authenticated and @requiresScopes enforce security at the router level +3. **Cross-Subgraph Sharing**: @shareable enables the same field to be resolved from multiple subgraphs +4. **Interface Entities**: @interfaceObject and @key on interfaces for advanced type patterns +5. **Custom Extensions**: Cosmo-specific directives for subscription filtering and description configuration + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / GraphQL Architect +- **Pain Points**: Implementing complex federation patterns; securing federated graphs; sharing fields across services +- **Goals**: Build sophisticated, secure, maintainable federated architectures using declarative patterns + +### Secondary Personas +- Security engineers implementing authorization policies +- Platform engineers designing federation patterns +- DevOps engineers understanding router behavior + +--- + +## Use Cases + +### Use Case 1: Declarative Authentication +**Scenario**: An API requires certain fields to be accessible only to authenticated users. +**How it works**: Fields or types are annotated with `@authenticated`. The router automatically validates authentication tokens before resolving these fields. Unauthenticated requests receive authorization errors without hitting subgraphs. +**Outcome**: Security enforced at the router level with zero subgraph code changes. + +### Use Case 2: Scope-Based Authorization +**Scenario**: Different API consumers have different permission levels based on JWT scopes. +**How it works**: Fields are annotated with `@requiresScopes(scopes: [["read:users"], ["admin"]])`. The router validates that the request's JWT contains the required scopes before allowing access. +**Outcome**: Fine-grained, declarative authorization based on token scopes. + +### Use Case 3: Cross-Subgraph Field Resolution +**Scenario**: Multiple subgraphs can resolve the same field (e.g., a User's name), and any should be able to serve requests. +**How it works**: The field is marked `@shareable` in all subgraphs that can resolve it. The query planner chooses the optimal subgraph based on the query plan. +**Outcome**: Flexible field resolution with automatic load distribution. + +### Use Case 4: Entity Interface Patterns +**Scenario**: An interface needs to be resolvable across subgraphs with consistent entity resolution. +**How it works**: The interface is declared with `@key(fields: "id")`. Subgraphs that contribute fields to implementing types use `@interfaceObject` to add fields to all implementations. +**Outcome**: Clean interface-based entity patterns with proper federation support. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Complete v1 and v2 directive support in a single router +2. Extended authorization directives (@authenticated, @requiresScopes) with router-level enforcement +3. Cosmo-specific extensions for subscription filtering +4. Clear documentation of directive behavior and normalization + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Router | Other Solutions | +|--------|-------|---------------|-----------------| +| v1 Directives | Full | Partial | Varies | +| v2 Directives | Full | Full | Partial | +| @authenticated | Yes | Yes | Custom | +| @requiresScopes | Yes | Yes | Custom | +| Custom extensions | Yes | Limited | Varies | + +--- + +## Technical Summary + +### How It Works +Directives are processed during schema composition and query planning. Authorization directives (@authenticated, @requiresScopes) are evaluated at the router level before subgraph requests. Field-level directives (@shareable, @provides, @requires) influence query planning decisions. The router normalizes directive declarations across subgraphs to produce a consistent federated schema. + +### Key Technical Features + +**Core Federation Directives:** +- `@key`: Declares entity primary keys (supports composite keys and resolvable argument) +- `@external`: Marks fields owned by other subgraphs +- `@requires`: Declares field dependencies from other subgraphs +- `@provides`: Indicates conditionally available fields +- `@extends`: Marks type extensions (v1 compatibility) + +**Federation v2 Directives:** +- `@shareable`: Enables multi-subgraph field resolution +- `@inaccessible`: Hides fields from client schema +- `@override`: Migrates field ownership between subgraphs +- `@interfaceObject`: Contributes fields to interface implementers +- `@tag`: Attaches metadata for contracts and tooling + +**Authorization Directives:** +- `@authenticated`: Requires valid authentication +- `@requiresScopes`: Requires specific JWT scopes (AND/OR logic) + +**Cosmo Extensions:** +- `@openfed__subscriptionFilter`: Filters subscription events +- `@openfed__configureDescription`: Controls description propagation +- `@semanticNonNull`: Indicates semantic non-nullability + +### Integration Points +- Composition engine for directive validation +- Router for authorization enforcement +- Query planner for field resolution decisions +- Authentication providers for token validation + +### Requirements & Prerequisites +- Subgraph schemas with proper directive declarations +- For authorization: JWT provider configuration on router +- For subscriptions: Event-driven federation setup + +--- + +## Documentation References + +- Primary docs: `/docs/federation/federation-directives-index` +- @shareable: `/docs/federation/directives/shareable` +- @authenticated: `/docs/federation/directives/authenticated` +- @requiresScopes: `/docs/federation/directives/requiresscopes` +- Compatibility matrix: `/docs/federation/federation-compatibility-matrix` +- Authentication setup: `/docs/router/authentication-and-authorization` + +--- + +## Keywords & SEO + +### Primary Keywords +- Federation directives +- GraphQL authorization directives +- @shareable directive + +### Secondary Keywords +- @authenticated GraphQL +- @requiresScopes +- Federation v2 directives + +### Related Search Terms +- GraphQL federation @key directive +- Declarative GraphQL authorization +- Cross-subgraph field sharing + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/federation/graphql-federation.md b/capabilities/federation/graphql-federation.md new file mode 100644 index 00000000..1eb01961 --- /dev/null +++ b/capabilities/federation/graphql-federation.md @@ -0,0 +1,171 @@ +# GraphQL Federation v1 & v2 + +Full support for both Apollo Federation protocol versions with a mature, highly-optimized GraphQL engine. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-fed-001` | +| **Category** | Federation | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-fed-002`, `cap-fed-006`, `cap-fed-007` | + +--- + +## Quick Reference + +### Name +GraphQL Federation v1 & v2 + +### Tagline +Run any Federation version with full compatibility. + +### Elevator Pitch +WunderGraph Cosmo provides complete compatibility with both Apollo Federation v1 and v2 protocols, enabling teams to unify their GraphQL microservices into a single, cohesive API. Built on a mature, highly-optimized GraphQL engine implemented in Go, it delivers enterprise-grade performance while supporting all federation directives and features. + +--- + +## Problem & Solution + +### The Problem +Organizations adopting GraphQL Federation face a critical choice: which federation version to use, and whether their tooling will support both existing v1 implementations and new v2 features. Many teams have invested in v1 schemas and cannot immediately migrate, while new projects want access to v2's enhanced capabilities. Running incompatible federation versions creates fragmentation and blocks unified API strategies. + +### The Solution +Cosmo's Router supports both Federation v1 and v2 protocols out of the box. Teams can run mixed environments, gradually migrate from v1 to v2, or start fresh with v2 features. The router automatically handles directive compatibility and query planning across both versions, removing migration friction and enabling unified graph strategies. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Forced to choose between v1 or v2 exclusively | Support both versions simultaneously | +| Complex migration projects required | Gradual migration at your own pace | +| Limited directive support | Full directive compatibility matrix | +| Performance concerns with federation overhead | Highly-optimized Go-based query planner | + +--- + +## Key Benefits + +1. **Full Protocol Compatibility**: Support for all v1 directives (@extends, @external, @key, @provides, @requires, @tag) and v2 additions (@inaccessible, @override, @shareable, @authenticated, @requiresScopes) +2. **Zero Lock-in**: Apache 2.0 licensed router means no vendor dependency and full code transparency +3. **Enterprise Performance**: Go-based implementation provides superior performance for high-throughput federation scenarios +4. **Mixed Version Support**: Run v1 and v2 subgraphs together in the same federated graph +5. **Future-Ready**: Continuous updates to support new federation specifications as they emerge + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Architect +- **Pain Points**: Need to unify multiple GraphQL services; concerned about federation version compatibility; want flexibility to evolve without rewrites +- **Goals**: Build a scalable, unified GraphQL API layer that supports both current and future team needs + +### Secondary Personas +- Backend developers building federated subgraphs +- DevOps engineers managing GraphQL infrastructure +- Engineering managers evaluating federation platforms + +--- + +## Use Cases + +### Use Case 1: Gradual v1 to v2 Migration +**Scenario**: A company has 15 subgraphs built on Federation v1 and wants to adopt v2 features for new services without disrupting existing infrastructure. +**How it works**: Teams continue running v1 subgraphs unchanged while new services use v2 directives. The Cosmo Router composes both versions into a unified graph, handling compatibility automatically. +**Outcome**: New features available immediately; legacy services migrate at a comfortable pace with zero downtime. + +### Use Case 2: Greenfield Federation Deployment +**Scenario**: A startup is building a new microservices architecture and wants to use the latest Federation v2 features from day one. +**How it works**: Teams define subgraphs using v2 directives like @shareable for cross-service field resolution and @authenticated for security policies. The router composes and serves the federated graph with full v2 support. +**Outcome**: Modern federation architecture with advanced authorization, field sharing, and composition capabilities. + +### Use Case 3: Multi-Team Schema Ownership +**Scenario**: Different teams own different subgraphs and have varying levels of GraphQL expertise; some prefer simpler v1 patterns. +**How it works**: Each team uses the federation version that matches their expertise. The platform team manages the composed graph without forcing version standardization. +**Outcome**: Reduced friction between teams while maintaining a unified API for consumers. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Open-source Apache 2.0 licensed router with no feature restrictions +2. Go-based implementation for superior performance versus Node.js alternatives +3. Simultaneous v1/v2 support without migration requirements +4. Integrated with full Cosmo platform (Studio, CLI, observability) + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Router | Other Solutions | +|--------|-------|---------------|-----------------| +| License | Apache 2.0 | Elastic License | Varies | +| v1 Support | Full | Limited | Partial | +| v2 Support | Full | Full | Partial | +| Performance | High (Go) | High (Rust) | Medium | +| Self-hosted | Yes | Yes | Varies | + +--- + +## Technical Summary + +### How It Works +The Cosmo Router fetches the latest valid router configuration from the CDN and creates a highly-optimized query planner. This query planner is cached across requests for performance. The router periodically checks the CDN for updates and reconfigures its engine on the fly, ensuring zero-downtime schema updates. + +### Key Technical Features +- Complete Federation v1 directive support: @extends, @external, @key (including composite keys), @provides, @requires, @tag +- Complete Federation v2 directive support: @inaccessible, @override, @shareable, @authenticated, @requiresScopes, @interfaceObject +- Interface entity support with @key on INTERFACE (v2.3) +- Resolvable key argument support (v2.0) +- Built on [graphql-go-tools](https://github.com/wundergraph/graphql-go-tools) - a mature, battle-tested GraphQL engine + +### Integration Points +- Control Plane for router registration and health monitoring +- CDN for router configuration distribution +- Studio for graph visualization and management +- Observability stack for tracing and metrics + +### Requirements & Prerequisites +- Cosmo Control Plane access (Cloud or self-hosted) +- Subgraphs implementing Federation v1 or v2 protocols +- Router deployment infrastructure (Docker, Kubernetes, etc.) + +--- + +## Documentation References + +- Primary docs: `/docs/router/intro` +- Compatibility matrix: `/docs/federation/federation-compatibility-matrix` +- Directives reference: `/docs/federation/federation-directives-index` +- Router configuration: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL Federation +- Apollo Federation alternative +- Federation v2 + +### Secondary Keywords +- Federated GraphQL +- GraphQL microservices +- Subgraph composition + +### Related Search Terms +- GraphQL federation router +- Federation v1 to v2 migration +- Open source federation gateway + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/federation/monograph-support.md b/capabilities/federation/monograph-support.md new file mode 100644 index 00000000..e41fe380 --- /dev/null +++ b/capabilities/federation/monograph-support.md @@ -0,0 +1,208 @@ +# Monograph Support + +Single-service GraphQL without federation complexity, with the option to migrate to federation when ready. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-fed-008` | +| **Category** | Federation | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-fed-001`, `cap-fed-007`, `cap-fed-005` | + +--- + +## Quick Reference + +### Name +Monograph Support + +### Tagline +Start simple, scale to federation when you are ready. + +### Elevator Pitch +Monographs provide a streamlined path to using Cosmo for single-service GraphQL APIs. Get all the benefits of the Cosmo platform (schema registry, checks, analytics, contracts) without federation complexity. When your architecture grows, seamlessly migrate to a federated graph without changing your infrastructure or losing your schema history. + +--- + +## Problem & Solution + +### The Problem +Not every GraphQL deployment needs federation from day one. Many teams start with a single GraphQL service and want professional tooling for schema management, analytics, and security. However, most federation-focused platforms require federation overhead even for simple deployments. Teams face a choice: use basic tooling now and migrate later, or adopt federation complexity before it is needed. + +### The Solution +Cosmo's Monograph support provides a first-class experience for single-service GraphQL. Create a monograph, publish your schema, and access all Cosmo platform features (Studio, schema checks, analytics, contracts) without configuring federation. When you are ready to scale to multiple services, a single CLI command migrates your monograph to a federated graph, preserving your schema history and configurations. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Limited tooling for non-federated GraphQL | Full platform features for monographs | +| Painful migration to federation later | Single-command migration when ready | +| No schema registry for simple APIs | Complete schema management | +| Basic analytics only | Full analytics and observability | + +--- + +## Key Benefits + +1. **Zero Federation Overhead**: Deploy single-service GraphQL without federation complexity +2. **Full Platform Access**: Schema registry, checks, analytics, and contracts available for monographs +3. **Seamless Migration**: Convert to federated graph with one CLI command when ready +4. **Contract Support**: Create filtered schema contracts even for single-service APIs +5. **Future-Proof**: Start simple with a clear path to scale + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / API Developer +- **Pain Points**: Wants professional GraphQL tooling but does not need federation yet; concerned about future migration complexity +- **Goals**: Get started quickly with good tooling and a clear path to scale + +### Secondary Personas +- Startup CTOs evaluating GraphQL platforms +- Teams modernizing from REST to GraphQL +- Small teams with single-service architectures + +--- + +## Use Cases + +### Use Case 1: Quick API Deployment +**Scenario**: A startup wants to deploy a GraphQL API with professional tooling but has only one service. +**How it works**: The team creates a monograph with `wgc monograph create production --routing-url http://router.example.com --graph-url http://api.example.com`. They publish their schema and start using Studio for schema management and analytics. +**Outcome**: Full Cosmo platform benefits without federation complexity. + +### Use Case 2: API with Multiple Audiences +**Scenario**: A single GraphQL service needs to expose different schemas to internal and external consumers. +**How it works**: The monograph schema uses @tag directives. Schema Contracts are created to filter the schema for different audiences. Each contract has its own router deployment. +**Outcome**: Multi-audience support without federation, using schema contracts. + +### Use Case 3: Migration to Federation +**Scenario**: A growing application needs to split into multiple services for scalability. +**How it works**: When ready, the team runs `wgc monograph migrate production`. The monograph becomes a federated graph with the original schema as its first subgraph. New subgraphs can now be added. +**Outcome**: Seamless transition from monograph to federation with preserved history. + +### Use Case 4: Schema Validation Workflow +**Scenario**: A team wants pre-deployment schema validation even for their single-service API. +**How it works**: The CI pipeline runs `wgc monograph check production --schema ./schema.graphql` before deployment. Breaking changes and lint issues are caught before reaching production. +**Outcome**: Safe schema evolution with automated validation. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Full platform features available for non-federated deployments +2. Single-command migration to federation +3. Schema Contracts work identically for monographs +4. No artificial limitations compared to federated graphs +5. Internal subgraph automatically managed + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Studio | DIY Solution | +|--------|-------|---------------|--------------| +| Non-federated support | Full | Limited | Full | +| Schema management | Yes | Partial | Manual | +| Migration to federation | Built-in | Manual | Major effort | +| Analytics | Yes | Limited | Custom | +| Schema contracts | Yes | No | Custom | + +--- + +## Technical Summary + +### How It Works +A monograph is essentially a federated graph with a single, automatically managed subgraph. When you create a monograph, Cosmo internally creates both the graph and its subgraph. You interact only with the monograph, and the internal subgraph is transparent. All publishing, checking, and management happens at the monograph level. + +The routing URL is where clients connect to the Cosmo Router, while the graph URL is your actual GraphQL server endpoint. The router proxies requests to your server, providing the full benefits of the Cosmo Router (security, analytics, caching) without federation overhead. + +### Key Technical Features + +**Create Operations:** +```bash +# Create a monograph +wgc monograph create production \ + --routing-url http://router.example.com/graphql \ + --graph-url https://api.example.com/graphql +``` + +**Publish Operations:** +```bash +# Publish schema to monograph +wgc monograph publish production --schema ./schema.graphql +``` + +**Check Operations:** +```bash +# Check schema before publishing +wgc monograph check production --schema ./schema.graphql +``` + +**Migration:** +```bash +# Migrate to federated graph +wgc monograph migrate production +``` + +**Additional Options:** +- `--subscription-url`: Separate URL for subscription requests +- `--subscription-protocol`: ws (default), sse, or sse_post +- `--admission-webhook-url`: Admission control webhook +- `--readme`: Documentation file for the monograph + +### Integration Points +- CLI (`wgc monograph *`) for all operations +- Studio for schema viewing and analytics +- Router for serving GraphQL requests +- Schema Contracts for filtered schemas + +### Requirements & Prerequisites +- GraphQL server accessible from router +- Router deployment infrastructure +- CLI authenticated with monograph permissions + +--- + +## Documentation References + +- Primary docs: `/docs/cli/monograph` +- Create command: `/docs/cli/monograph/create` +- Publish command: `/docs/cli/monograph/publish` +- Check command: `/docs/cli/monograph/check` +- Migrate command: `/docs/cli/monograph/migrate` +- Schema contracts: `/docs/concepts/schema-contracts` + +--- + +## Keywords & SEO + +### Primary Keywords +- Monograph GraphQL +- Single-service GraphQL +- GraphQL without federation + +### Secondary Keywords +- GraphQL migration +- Simple GraphQL deployment +- GraphQL starter + +### Related Search Terms +- Start with GraphQL +- GraphQL single service +- Migrate to federation + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/federation/schema-checks.md b/capabilities/federation/schema-checks.md new file mode 100644 index 00000000..86b09ed8 --- /dev/null +++ b/capabilities/federation/schema-checks.md @@ -0,0 +1,183 @@ +# Schema Checks + +Pre-deployment validation including composition errors, breaking changes, operation checks, and lint rules to ensure safe schema evolution. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-fed-003` | +| **Category** | Federation | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-fed-002`, `cap-fed-004`, `cap-fed-007` | + +--- + +## Quick Reference + +### Name +Schema Checks + +### Tagline +Validate schema changes before they reach production. + +### Elevator Pitch +Schema Checks provide comprehensive pre-deployment validation for GraphQL schema changes. Before any modification reaches production, Cosmo validates composition compatibility, detects breaking changes, analyzes real client traffic to assess impact, and enforces lint rules. This multi-layered approach catches issues in CI/CD pipelines, enabling confident, safe schema evolution. + +--- + +## Problem & Solution + +### The Problem +Schema changes in federated GraphQL can have far-reaching consequences. A field removal might break mobile apps still using older queries. A type change could cause composition failures across multiple subgraphs. Traditional testing approaches miss these issues because they lack visibility into real client usage patterns and cross-subgraph dependencies. + +### The Solution +Cosmo's Schema Checks run four types of validation before any schema change is published: + +1. **Composition Errors**: Validates the schema can compose with all other subgraphs +2. **Breaking Change Detection**: Identifies changes that could break existing client operations +3. **Operation Checks**: Analyzes real client traffic to determine if breaking changes affect active operations +4. **Lint Checks**: Enforces schema design standards and best practices + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Breaking changes discovered in production | Breaking changes caught in CI/CD | +| No visibility into client impact | Real traffic analysis shows affected operations | +| Manual schema review for quality | Automated lint enforcement | +| Composition failures after deployment | Composition validated before publish | + +--- + +## Key Benefits + +1. **Four-Layer Validation**: Comprehensive checks covering composition, breaking changes, traffic impact, and lint rules +2. **Traffic-Aware Analysis**: Operation Checks use real client traffic data to determine actual impact of breaking changes +3. **CI/CD Integration**: Run checks automatically in pull request workflows with GitHub integration +4. **Override Capability**: Force necessary breaking changes with proper documentation when needed +5. **Historical Tracking**: Complete history of all checks performed with pass/fail status and details + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / API Developer +- **Pain Points**: Fear of breaking production clients; uncertainty about schema change impact; manual review processes slowing development +- **Goals**: Ship schema changes confidently with automated safety nets + +### Secondary Personas +- Platform engineers building CI/CD pipelines +- QA engineers validating API changes +- Engineering managers tracking change risk + +--- + +## Use Cases + +### Use Case 1: Pull Request Validation +**Scenario**: A developer submits a PR that removes a deprecated field from a subgraph schema. +**How it works**: The CI pipeline runs `wgc subgraph check` with the proposed schema. Cosmo validates composition, detects the field removal as a breaking change, and checks 7 days of client traffic. If no active clients use the field, the check passes. If clients are affected, the PR shows exactly which operations would break. +**Outcome**: Developer knows the impact before merging; safe changes proceed automatically; risky changes require explicit review. + +### Use Case 2: Safe Breaking Changes with Override +**Scenario**: A team needs to remove a field that some clients still use, but the change is necessary for a major refactor. +**How it works**: The schema check fails due to active client usage. The team uses the GitHub integration to manually override the check after coordinating with affected clients. The override is documented in the check history. +**Outcome**: Necessary breaking changes can proceed with proper governance and audit trail. + +### Use Case 3: Schema Quality Enforcement +**Scenario**: An organization wants to enforce naming conventions and description requirements across all subgraphs. +**How it works**: Lint rules are configured in the namespace policies. Every schema check validates against these rules, flagging fields without descriptions, inconsistent naming, or other policy violations. +**Outcome**: Consistent, high-quality schema design across all teams and subgraphs. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Real traffic analysis with Operation Checks (not just static analysis) +2. Native GitHub integration for PR-based workflows +3. Configurable check timeframes and policies +4. Combined composition, breaking change, traffic, and lint checks in one flow + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Studio | DIY Solution | +|--------|-------|---------------|--------------| +| Composition validation | Yes | Yes | Partial | +| Breaking change detection | Yes | Yes | Manual | +| Traffic-based checks | Yes | Yes | No | +| Lint enforcement | Yes | Limited | Custom | +| GitHub integration | Native | Yes | Custom | +| Self-hosted | Yes | No | Yes | + +--- + +## Technical Summary + +### How It Works +The `wgc subgraph check` command sends the proposed schema to the control plane for validation. The check process: + +1. Attempts composition with all other subgraphs in matching federated graphs +2. Compares the resulting schema against the current production schema for breaking changes +3. If breaking changes exist, queries the analytics database for client operations using affected fields (default: 7-day window) +4. Validates the schema against configured lint rules +5. Returns pass/fail status with detailed results for each check type + +### Key Technical Features +- VCS context integration (author, commit, branch) for traceability +- Configurable traffic analysis windows via namespace policies +- Support for checking new subgraphs that don't exist yet +- Deletion impact analysis with `--delete` flag +- Warning suppression for known issues + +### Integration Points +- CLI (`wgc subgraph check`) for running checks +- GitHub integration for PR workflows +- Studio for viewing check history and details +- Analytics pipeline for operation traffic data + +### Requirements & Prerequisites +- Subgraph registered in namespace (or labels provided for new subgraphs) +- Router sending traffic to Cosmo Cloud for Operation Checks +- CLI authenticated with check permissions + +--- + +## Documentation References + +- Primary docs: `/docs/studio/schema-checks` +- CLI reference: `/docs/cli/subgraph/check` +- GitHub integration: `/docs/tutorial/pr-based-workflow-for-federation` +- Namespace policies: `/docs/studio/policies` + +--- + +## Keywords & SEO + +### Primary Keywords +- Schema checks +- GraphQL breaking changes +- Schema validation + +### Secondary Keywords +- Operation checks +- GraphQL CI/CD +- Schema lint + +### Related Search Terms +- GraphQL breaking change detection +- Federation schema validation +- GraphQL PR checks + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/federation/schema-composition.md b/capabilities/federation/schema-composition.md new file mode 100644 index 00000000..f21f6cd6 --- /dev/null +++ b/capabilities/federation/schema-composition.md @@ -0,0 +1,171 @@ +# Schema Composition + +Automatic composition of federated graphs from multiple subgraphs with comprehensive error detection and version tracking. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-fed-002` | +| **Category** | Federation | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-fed-001`, `cap-fed-003`, `cap-fed-004` | + +--- + +## Quick Reference + +### Name +Schema Composition + +### Tagline +Automatically compose subgraphs into a unified federated graph. + +### Elevator Pitch +WunderGraph Cosmo's Schema Composition automatically merges multiple subgraph schemas into a single, cohesive federated graph. Every composition is tracked with detailed version history, error reporting, and change detection, giving teams complete visibility into their graph's evolution and ensuring routers always serve valid, optimized configurations. + +--- + +## Problem & Solution + +### The Problem +In federated GraphQL architectures, combining multiple subgraph schemas into a working supergraph is complex and error-prone. Schema conflicts, missing fields, incompatible types, and broken references can all cause composition failures. Without proper tooling, teams discover these issues only after deployment, leading to production incidents and frustrated developers trying to trace issues across multiple services. + +### The Solution +Cosmo's composition engine automatically validates and merges subgraph schemas whenever changes are published. Every composition attempt is recorded with detailed inputs, outputs, and any errors encountered. Teams can track the complete history of their federated graph, understand what changed between versions, and know exactly which version is currently served by their routers. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manual schema merging with potential errors | Automatic composition on every publish | +| No visibility into composition history | Complete audit trail of all compositions | +| Difficult to identify what changed | Visual diff between schema versions | +| Unclear which version routers are using | Clear indication of active router configuration | + +--- + +## Key Benefits + +1. **Automatic Composition**: Schemas are composed automatically when subgraphs are published, with no manual intervention required +2. **Comprehensive Error Detection**: Composition errors are caught and reported before they can affect production routers +3. **Version History**: Every composition is tracked, enabling rollback and audit capabilities +4. **Change Visibility**: Visual diffs show exactly what changed in the federated graph after each composition +5. **Router Synchronization**: Clear visibility into which composed schema version is active on routers + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / GraphQL Architect +- **Pain Points**: Managing schema evolution across multiple teams; ensuring composition validity before deployment; tracking changes for compliance +- **Goals**: Maintain a stable, well-documented federated graph that evolves safely over time + +### Secondary Personas +- Backend developers publishing subgraph changes +- DevOps engineers monitoring graph health +- Compliance officers requiring audit trails + +--- + +## Use Cases + +### Use Case 1: Continuous Schema Evolution +**Scenario**: A development team publishes multiple subgraph updates daily and needs confidence that each change composes correctly. +**How it works**: When a developer runs `wgc subgraph publish`, Cosmo automatically attempts composition with all other subgraphs. If successful, the new composed schema is made available to routers. If not, detailed errors are logged and the existing valid schema remains active. +**Outcome**: Continuous delivery of schema changes with automatic validation and zero-downtime updates. + +### Use Case 2: Debugging Composition Failures +**Scenario**: A new subgraph update fails to compose, and the team needs to understand why. +**How it works**: The Compositions page in Studio shows the failed composition attempt with detailed error messages indicating exactly which types, fields, or directives caused the conflict. Developers can see the input schemas and understand the root cause. +**Outcome**: Rapid identification and resolution of schema conflicts with clear, actionable error messages. + +### Use Case 3: Schema Version Audit +**Scenario**: A compliance review requires documentation of all schema changes over the past quarter. +**How it works**: The Compositions page provides a complete history of all composition attempts with timestamps, triggering users, and resulting schemas. Teams can export this information and compare any two versions. +**Outcome**: Full audit trail for compliance requirements with minimal manual effort. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Integrated with full lifecycle management (checks, registry, contracts) +2. Visual Studio interface for exploring composition history +3. Clear router synchronization status +4. Detailed error messages with actionable guidance + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Studio | DIY Solution | +|--------|-------|---------------|--------------| +| Automatic composition | Yes | Yes | Manual | +| Composition history | Full | Limited | None | +| Visual diffs | Yes | Yes | No | +| Router sync visibility | Yes | Yes | No | +| Self-hosted option | Yes | No | Yes | + +--- + +## Technical Summary + +### How It Works +When a subgraph schema is published via the CLI or API, the control plane triggers a composition process. This process validates the new schema against all other subgraphs in the federated graph, checking for type conflicts, missing references, and federation directive compliance. If composition succeeds, the new router configuration is pushed to the CDN. Routers periodically check for updates and hot-reload the new configuration. + +### Key Technical Features +- Real-time composition on subgraph publish +- Detailed composition error reporting with field-level precision +- Schema diff visualization showing additions, removals, and modifications +- Composition trigger tracking (who, when, what) +- Router configuration version tracking + +### Integration Points +- CLI (`wgc subgraph publish`) for triggering compositions +- Studio for viewing composition history and errors +- CDN for distributing composed schemas to routers +- Schema Checks for pre-publish validation + +### Requirements & Prerequisites +- At least one subgraph created in the namespace +- Federated graph configured with label matchers +- CLI authenticated with appropriate permissions + +--- + +## Documentation References + +- Primary docs: `/docs/studio/compositions` +- Publishing schemas: `/docs/cli/subgraph/publish` +- Schema checks: `/docs/studio/schema-checks` +- Federated graph setup: `/docs/cli/federated-graph/create` + +--- + +## Keywords & SEO + +### Primary Keywords +- Schema composition +- GraphQL federation composition +- Supergraph composition + +### Secondary Keywords +- Subgraph merging +- Schema validation +- Federation errors + +### Related Search Terms +- GraphQL schema composition errors +- Federated graph composition +- Subgraph composition validation + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/federation/schema-contracts.md b/capabilities/federation/schema-contracts.md new file mode 100644 index 00000000..a58958bd --- /dev/null +++ b/capabilities/federation/schema-contracts.md @@ -0,0 +1,187 @@ +# Schema Contracts + +Filter graph sections for different audiences using @tag directives to create tailored API experiences while maintaining a single source graph. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-fed-005` | +| **Category** | Federation | +| **Status** | GA | +| **Availability** | Pro / Enterprise | +| **Related Capabilities** | `cap-fed-001`, `cap-fed-002`, `cap-fed-006` | + +--- + +## Quick Reference + +### Name +Schema Contracts + +### Tagline +One graph, multiple tailored API experiences. + +### Elevator Pitch +Schema Contracts enable you to create filtered versions of your federated graph for different audiences. Using simple @tag directives, you can exclude sensitive fields from public APIs, create partner-specific views, or simplify schemas for specific use cases. Maintain one source of truth while serving multiple tailored API experiences through separate router deployments. + +--- + +## Problem & Solution + +### The Problem +As federated graphs grow, they serve increasingly diverse audiences: internal teams, external partners, public consumers, and different tenants. Each audience has different access needs and should only see relevant parts of the schema. Without proper tooling, teams either expose too much (security risk) or maintain multiple duplicate graphs (maintenance nightmare). + +### The Solution +Schema Contracts allow you to annotate your schema with @tag directives and then create filtered views (contracts) that include or exclude tagged elements. Each contract gets its own router deployment serving a tailored schema, while all contracts automatically stay in sync with the source graph. One graph to maintain, multiple APIs to serve. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Duplicate graphs for different audiences | Single source graph with filtered contracts | +| Manual schema maintenance across versions | Automatic synchronization on source changes | +| Risk of exposing sensitive fields | Explicit tag-based filtering | +| Complex multi-graph deployment | Contract-based router deployment | + +--- + +## Key Benefits + +1. **Multi-Audience Support**: Serve different schemas to internal, partner, and public consumers from one source +2. **Security by Design**: Explicitly filter sensitive fields using @tag directives and exclude patterns +3. **Automatic Synchronization**: Contracts recompose automatically when the source graph changes +4. **Independent Routing**: Each contract has its own router with separate analytics and persisted operations +5. **Simplified Maintenance**: Update the source graph once; all contracts update accordingly + +--- + +## Target Audience + +### Primary Persona +- **Role**: API Product Manager / Platform Engineer +- **Pain Points**: Serving different API versions to different consumers; protecting internal fields from external access; maintaining multiple graph versions +- **Goals**: Deliver tailored API experiences efficiently while maintaining security and reducing maintenance burden + +### Secondary Personas +- Security engineers concerned about data exposure +- Partner integration teams needing customized APIs +- Enterprise architects managing multi-tenant platforms + +--- + +## Use Cases + +### Use Case 1: Public vs. Internal API +**Scenario**: A company wants to expose a public GraphQL API while keeping internal fields (like social security numbers) hidden from external consumers. +**How it works**: Internal fields are tagged with `@tag(name: "internal")` or `@tag(name: "sensitive")`. A public contract is created that excludes these tags. The public router serves only the filtered schema. +**Outcome**: External consumers see a clean, safe API while internal applications access the full graph. + +### Use Case 2: Partner-Specific APIs +**Scenario**: Different partners need access to different subsets of the API based on their integration agreements. +**How it works**: Fields are tagged with partner-specific identifiers like `@tag(name: "partner-a")`. Separate contracts are created for each partner, including only their relevant tags. +**Outcome**: Each partner gets a customized API experience without maintaining separate graphs. + +### Use Case 3: Legacy System Integration +**Scenario**: A company is modernizing its API but needs to maintain backward compatibility with legacy consumers during migration. +**How it works**: New fields are tagged with `@tag(name: "v2")`. A legacy contract excludes v2 tags, serving the original schema to legacy consumers. A modern contract includes all fields. +**Outcome**: Gradual migration without breaking legacy integrations. + +### Use Case 4: Multi-Tenant Isolation +**Scenario**: A SaaS platform serves multiple tenants who should only see their relevant schema portions. +**How it works**: Tenant-specific features are tagged appropriately. Contracts are created per tenant, filtering to their authorized feature set. +**Outcome**: Tenant data isolation at the schema level with customized feature visibility. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Simple @tag-based annotation (no complex configuration) +2. Automatic recomposition on source changes +3. Independent router deployments per contract +4. Full schema checks applied to contracts automatically +5. Works with both federated graphs and monographs + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Contracts | DIY Solution | +|--------|-------|------------------|--------------| +| Tag-based filtering | Yes | Yes | Custom | +| Automatic sync | Yes | Yes | Manual | +| Independent routers | Yes | Yes | Custom | +| Schema checks | Included | Separate | Manual | +| Self-hosted | Yes | No | Yes | + +--- + +## Technical Summary + +### How It Works +Schema Contracts use @tag directives to mark schema elements for filtering. When creating a contract via CLI, you specify which tags to exclude. The control plane generates two schemas: + +1. **Router Schema**: Used internally for query planning, includes @inaccessible fields needed for federation +2. **Client Schema**: Exposed via introspection, excludes all filtered elements + +Contracts recompose automatically when: +- The contract is created +- Any subgraph is created, updated, moved, or deleted +- The source graph moves to a new namespace +- Label matchers on the source graph change +- A monograph source publishes a new schema + +### Key Technical Features +- @tag directive on objects, interfaces, inputs, types, and fields +- Exclude patterns for filtering (e.g., `--exclude sensitive --exclude private`) +- Inherited labels from source graph (cannot be independently modified) +- Automatic namespace following of source graph +- Same graph type as source (federated or monograph) + +### Integration Points +- CLI (`wgc contract create`, `wgc contract update`) for management +- Studio for viewing contract schemas and compositions +- Router deployment for serving contract schemas +- Schema Checks automatically validate contracts + +### Requirements & Prerequisites +- Source graph (federated graph or monograph) with @tag annotations +- CLI authenticated with contract management permissions +- Router deployment infrastructure for each contract + +--- + +## Documentation References + +- Primary docs: `/docs/concepts/schema-contracts` +- Studio guide: `/docs/studio/schema-contracts` +- CLI reference: `/docs/cli/schema-contracts` +- Contract creation: `/docs/cli/schema-contracts/create` + +--- + +## Keywords & SEO + +### Primary Keywords +- Schema contracts +- GraphQL API versioning +- Multi-tenant GraphQL + +### Secondary Keywords +- Schema filtering +- API segmentation +- Tag-based schema + +### Related Search Terms +- GraphQL multiple audiences +- Filter GraphQL schema +- GraphQL public vs internal API + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/federation/schema-registry.md b/capabilities/federation/schema-registry.md new file mode 100644 index 00000000..e9aa5787 --- /dev/null +++ b/capabilities/federation/schema-registry.md @@ -0,0 +1,170 @@ +# Schema Registry + +Centralized schema management with version history, comparison tools, and easy access to both federated and subgraph schemas. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-fed-004` | +| **Category** | Federation | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-fed-002`, `cap-fed-003`, `cap-fed-007` | + +--- + +## Quick Reference + +### Name +Schema Registry + +### Tagline +Your single source of truth for all GraphQL schemas. + +### Elevator Pitch +The Schema Registry provides a centralized, version-controlled repository for all your GraphQL schemas. View the current state of your federated graph, explore individual subgraph schemas, track changes over time, and export schemas for development tools. It is the authoritative source for understanding your graph's structure and evolution. + +--- + +## Problem & Solution + +### The Problem +In distributed GraphQL architectures, schemas are scattered across multiple repositories and services. Developers struggle to find the current production schema, understand how types are defined across subgraphs, or get an authoritative view of the federated graph. Without a central registry, teams work with outdated schemas, miss type definitions, and lack visibility into the complete API surface. + +### The Solution +Cosmo's Schema Registry maintains the authoritative version of all schemas in one place. The composed federated graph schema and each individual subgraph schema are accessible through Studio, with copy and download capabilities for integration with development tools. Last-updated timestamps ensure developers always know when schemas changed. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Schemas scattered across repositories | Single source of truth in Studio | +| Unclear which schema version is active | Current production schema always visible | +| Manual schema file management | One-click copy or download | +| No visibility into subgraph schemas | View any subgraph schema instantly | + +--- + +## Key Benefits + +1. **Centralized Access**: View the complete federated graph schema and all subgraph schemas in one place +2. **Always Current**: Registry reflects the latest successfully composed schema +3. **Export Capabilities**: Copy to clipboard or download as `.graphql` files for tooling integration +4. **Subgraph Visibility**: Dropdown selection to view any subgraph's individual schema +5. **Change Tracking**: Last-updated timestamps show when schemas were modified + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / Frontend Developer +- **Pain Points**: Finding the current production schema; understanding type definitions across services; keeping local development in sync with production +- **Goals**: Quickly access accurate schema information for development and debugging + +### Secondary Personas +- API consumers needing schema documentation +- Technical writers documenting APIs +- Solution architects reviewing API design + +--- + +## Use Cases + +### Use Case 1: Development Environment Setup +**Scenario**: A new developer needs to set up their local environment with the current production schema for code generation. +**How it works**: The developer opens Studio, navigates to the Schema Registry, and downloads the federated graph schema as a `.graphql` file. They import this into their code generation tooling. +**Outcome**: Local development environment matches production exactly with minimal setup time. + +### Use Case 2: Debugging Type Definitions +**Scenario**: A frontend developer needs to understand how a particular type is defined and which subgraph owns it. +**How it works**: The developer views the federated schema to see the complete type definition. They then use the subgraph dropdown to check each relevant subgraph's contribution to that type. +**Outcome**: Clear understanding of type ownership and field definitions for debugging. + +### Use Case 3: API Documentation +**Scenario**: A technical writer needs to document the current API for external consumers. +**How it works**: The writer accesses the Schema Registry, copies the complete federated schema, and uses it as the source for API documentation tools like GraphQL Voyager or SpectaQL. +**Outcome**: Accurate, up-to-date API documentation generated from the authoritative schema source. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Integrated with composition, checks, and contracts for full lifecycle management +2. Both federated and individual subgraph schema visibility +3. Simple copy/download workflow for tooling integration +4. Clear last-updated timestamps for version awareness + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Studio | DIY Solution | +|--------|-------|---------------|--------------| +| Federated schema view | Yes | Yes | Manual | +| Subgraph schema view | Yes | Yes | Manual | +| Download capability | Yes | Yes | Manual | +| Version timestamps | Yes | Yes | Custom | +| Self-hosted option | Yes | No | Yes | + +--- + +## Technical Summary + +### How It Works +The Schema Registry displays the most recent successfully composed schema for your federated graph. When a composition succeeds (triggered by subgraph publishing), the new schema becomes the active version in the registry. The registry maintains both the Router Schema (used for query planning, includes @inaccessible fields) and the Client Schema (exposed via introspection). + +### Key Technical Features +- SDL (Schema Definition Language) view of complete schemas +- Subgraph selector for viewing individual service schemas +- One-click copy to clipboard +- Download as `.graphql` file +- Last-updated timestamp display + +### Integration Points +- Composition engine for schema updates +- Studio UI for visualization and access +- CDN for router configuration (uses registry schemas) +- Code generation tools (via downloaded schemas) + +### Requirements & Prerequisites +- Federated graph with at least one successful composition +- Studio access for viewing schemas +- Appropriate namespace permissions + +--- + +## Documentation References + +- Primary docs: `/docs/studio/schema-registry` +- Composition overview: `/docs/studio/compositions` +- Schema explorer: `/docs/studio/schema-explorer` + +--- + +## Keywords & SEO + +### Primary Keywords +- Schema registry +- GraphQL schema management +- Schema versioning + +### Secondary Keywords +- Schema repository +- GraphQL schema storage +- Federated schema + +### Related Search Terms +- GraphQL schema version control +- Central schema repository +- Schema documentation + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/federation/subgraph-management.md b/capabilities/federation/subgraph-management.md new file mode 100644 index 00000000..1e3ea714 --- /dev/null +++ b/capabilities/federation/subgraph-management.md @@ -0,0 +1,223 @@ +# Subgraph Management + +Create, publish, update, and delete subgraphs with full lifecycle management through CLI and Studio interfaces. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-fed-007` | +| **Category** | Federation | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-fed-001`, `cap-fed-002`, `cap-fed-003` | + +--- + +## Quick Reference + +### Name +Subgraph Management + +### Tagline +Complete lifecycle control for your federated services. + +### Elevator Pitch +Subgraph Management provides comprehensive tools for managing the building blocks of your federated graph. From initial creation through publishing, updating, and eventual deprecation, Cosmo's CLI and Studio give teams complete control over subgraph lifecycles. Labels enable flexible composition rules, while namespace isolation supports multi-environment workflows. + +--- + +## Problem & Solution + +### The Problem +Managing subgraphs in a federated architecture involves multiple operations: registering new services, publishing schema changes, updating routing URLs, managing labels for composition, and safely retiring old services. Without unified tooling, teams juggle multiple systems, risk configuration drift, and lack visibility into subgraph states across environments. + +### The Solution +Cosmo provides a complete subgraph lifecycle management solution: +- **Create**: Register new subgraphs with labels and routing URLs +- **Publish**: Push schema changes with automatic composition +- **Check**: Validate changes before publishing +- **Update**: Modify metadata like URLs and labels +- **Delete**: Safely remove subgraphs with impact analysis + +All operations are available via CLI for automation and Studio for visual management. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manual subgraph registration | CLI/API-driven creation | +| Ad-hoc schema file distribution | Centralized publish workflow | +| Unclear subgraph-to-graph relationships | Label-based composition rules | +| Risky subgraph removal | Pre-deletion impact analysis | + +--- + +## Key Benefits + +1. **Full Lifecycle Control**: Create, publish, update, delete operations for complete management +2. **Label-Based Composition**: Flexible labeling system matches subgraphs to federated graphs +3. **Namespace Isolation**: Separate environments (dev, staging, prod) with namespace support +4. **Pre-Change Validation**: Check command validates impact before publishing +5. **Automation-Friendly**: CLI and API access for CI/CD integration + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / Platform Engineer +- **Pain Points**: Managing multiple subgraphs across environments; ensuring safe schema deployments; tracking subgraph configurations +- **Goals**: Efficiently manage subgraph lifecycle with confidence in deployment safety + +### Secondary Personas +- DevOps engineers automating deployments +- Team leads tracking subgraph ownership +- Release managers coordinating schema changes + +--- + +## Use Cases + +### Use Case 1: New Subgraph Registration +**Scenario**: A team is launching a new microservice that needs to join the federated graph. +**How it works**: The team runs `wgc subgraph create products --label team=backend --routing-url http://products:4001/graphql`. This registers the subgraph with its routing URL and labels. They then publish the initial schema with `wgc subgraph publish products --schema ./schema.graphql`. +**Outcome**: New service is registered and composed into the federated graph automatically. + +### Use Case 2: Safe Schema Update +**Scenario**: A developer needs to add new fields and deprecate old ones in an existing subgraph. +**How it works**: The developer first runs `wgc subgraph check products --schema ./new-schema.graphql` to validate the changes won't break existing clients. If checks pass, they run `wgc subgraph publish products --schema ./new-schema.graphql`. +**Outcome**: Schema changes deployed safely with pre-validation preventing breaking changes. + +### Use Case 3: Environment Promotion +**Scenario**: A schema change tested in staging needs to be promoted to production. +**How it works**: The same schema file is published to the production namespace: `wgc subgraph publish products --namespace production --schema ./schema.graphql`. Labels match production federated graphs for automatic composition. +**Outcome**: Consistent schema promotion across environments with namespace isolation. + +### Use Case 4: Subgraph Retirement +**Scenario**: A legacy subgraph is being deprecated and needs to be removed from the federated graph. +**How it works**: The team first runs `wgc subgraph check products --delete` to see the impact on all connected federated graphs. After confirming no critical dependencies, they run `wgc subgraph delete products`. +**Outcome**: Safe removal with pre-deletion impact analysis preventing accidental breakage. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Combined create-and-publish in single command for streamlined workflows +2. Label-based composition for flexible graph membership +3. Pre-deletion impact analysis with `--delete` check +4. Event-Driven Graph support with `--edg` flag +5. Subscription protocol configuration per subgraph + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Studio | DIY Solution | +|--------|-------|---------------|--------------| +| CLI management | Full | Full | Custom | +| Label-based composition | Yes | No (variants) | Custom | +| Pre-delete checks | Yes | Limited | Manual | +| Subscription config | Per-subgraph | Global | Custom | +| Self-hosted | Yes | No | Yes | + +--- + +## Technical Summary + +### How It Works +Subgraphs are registered in the control plane with unique names within their namespace. Each subgraph has: +- **Routing URL**: Where the router sends requests +- **Labels**: Key-value pairs for federated graph matching +- **Subscription settings**: Protocol and URL configuration +- **Schema**: The GraphQL SDL published via CLI + +When a subgraph is published, the control plane triggers composition with all federated graphs whose label matchers include the subgraph's labels. + +### Key Technical Features + +**Create Operations:** +```bash +# Regular subgraph +wgc subgraph create products --label team=A --routing-url http://localhost:4001/graphql + +# Event-Driven Graph +wgc subgraph create events --label team=A --edg +``` + +**Publish Operations:** +```bash +# Publish to existing subgraph +wgc subgraph publish products --schema ./schema.graphql + +# Create and publish in one step +wgc subgraph publish products --schema ./schema.graphql --routing-url http://localhost:4001/graphql --label team=A +``` + +**Update Operations:** +```bash +# Update routing URL +wgc subgraph update products -r http://new-domain.com/graphql + +# Update labels +wgc subgraph update products --label team=B department=eng +``` + +**Delete Operations:** +```bash +# Check deletion impact +wgc subgraph check products --delete + +# Delete subgraph +wgc subgraph delete products +``` + +### Integration Points +- CLI (`wgc subgraph *`) for all operations +- Studio for visual management and schema viewing +- Composition engine for automatic graph updates +- Schema checks for pre-publish validation + +### Requirements & Prerequisites +- Namespace access with appropriate permissions +- Routing URL accessible from router (for non-EDG subgraphs) +- Schema file in GraphQL SDL format + +--- + +## Documentation References + +- Primary docs: `/docs/cli/subgraph` +- Create command: `/docs/cli/subgraph/create` +- Publish command: `/docs/cli/subgraph/publish` +- Update command: `/docs/cli/subgraph/update` +- Delete command: `/docs/cli/subgraph/delete` +- Check command: `/docs/cli/subgraph/check` + +--- + +## Keywords & SEO + +### Primary Keywords +- Subgraph management +- GraphQL subgraph +- Federation subgraph + +### Secondary Keywords +- Subgraph lifecycle +- GraphQL microservices +- Federated service management + +### Related Search Terms +- Create GraphQL subgraph +- Publish subgraph schema +- Manage federation subgraphs + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/grpc/cosmo-connect.md b/capabilities/grpc/cosmo-connect.md new file mode 100644 index 00000000..5aa07ace --- /dev/null +++ b/capabilities/grpc/cosmo-connect.md @@ -0,0 +1,209 @@ +# Cosmo Connect + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-grpc-001` | +| **Category** | gRPC | +| **Status** | Beta | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-grpc-002`, `cap-grpc-003` | + +--- + +## Quick Reference + +### Name +Cosmo Connect + +### Tagline +Federate without boundaries: integrate any backend into your supergraph. + +### Elevator Pitch +Cosmo Connect enables GraphQL Federation without requiring backend teams to run GraphQL servers or frameworks. By compiling GraphQL into gRPC, it moves the complexity of the query language into the Router, allowing teams to implement familiar gRPC contracts in any supported language while gaining all the benefits of federation. + +--- + +## Problem & Solution + +### The Problem +Organizations wanting to adopt GraphQL Federation face a significant barrier: backend teams must learn GraphQL and migrate their existing REST, gRPC, SOAP, or legacy services to a Federation-compatible framework. This requirement creates friction, slows adoption, and limits the languages and frameworks teams can use. Poor GraphQL server library support in certain ecosystems further compounds the challenge. + +### The Solution +Cosmo Connect eliminates this barrier by allowing teams to define an Apollo-compatible Subgraph Schema, compile it into a protobuf definition, and implement it using any gRPC stack (Go, Java, C#, Python, Rust, and many others). No GraphQL knowledge or specific framework is required. The Cosmo Router handles all GraphQL query planning, batching, and response aggregation, while backend teams work with familiar request/response semantics. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Backend teams must learn GraphQL to participate in federation | Teams implement familiar gRPC contracts without GraphQL knowledge | +| Limited language options due to poor GraphQL library support | Any language with gRPC support can participate in the supergraph | +| Migrating legacy systems requires full GraphQL subgraph rewrites | Wrap existing APIs (REST, SOAP) without building full subgraphs | +| Each subgraph framework has varying spec compliance and quality | Strongly-typed proto definitions guarantee correct implementations | + +--- + +## Key Benefits + +1. **Federation Without GraphQL Servers**: Backend teams implement gRPC contracts instead of GraphQL resolvers, eliminating the need for GraphQL expertise across all teams. + +2. **Language Flexibility**: Leverage gRPC code generation across nearly all ecosystems, including those with limited GraphQL server library support. + +3. **Reduced Migration Effort**: Wrap existing APIs (REST, SOAP, legacy systems) without writing full subgraphs, lowering the cost of moving from monoliths to federation. + +4. **All Cosmo Platform Benefits**: Breaking change detection, centralized telemetry, governance, and observability work out of the box. + +5. **No N+1 Problems**: Unlike declarative approaches, Cosmo Connect leverages the Router's DataLoader capabilities which batch requests by default. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Architect +- **Pain Points**: Difficulty getting all backend teams to adopt GraphQL; existing investments in REST/gRPC services; heterogeneous tech stacks across teams +- **Goals**: Unify APIs under a single GraphQL gateway without forcing organizational-wide technology changes; accelerate federation adoption + +### Secondary Personas +- Backend developers who prefer working with gRPC over GraphQL +- Engineering managers looking to reduce migration costs and timelines +- Teams maintaining legacy systems that need integration with modern APIs + +--- + +## Use Cases + +### Use Case 1: Integrating Legacy REST APIs +**Scenario**: A financial services company has dozens of critical REST APIs that cannot be rewritten but need to be exposed through a unified GraphQL API. +**How it works**: Define a GraphQL subgraph schema representing the REST API's capabilities, generate protobuf definitions, implement a thin gRPC adapter that calls the REST endpoints, and connect it to the federated graph. +**Outcome**: Legacy APIs become first-class citizens in the GraphQL federation without rewriting any existing business logic. + +### Use Case 2: Multi-Language Microservices +**Scenario**: An e-commerce platform has services written in Java, Python, and Go, and each team wants to use their preferred language. +**How it works**: Each team defines their subgraph schema, generates protobuf definitions, and implements the gRPC service in their language of choice. The Router handles all cross-service coordination. +**Outcome**: Teams maintain autonomy over their technology choices while contributing to a unified API experience. + +### Use Case 3: AI-Assisted Adapter Generation +**Scenario**: A development team needs to quickly expose multiple internal APIs through GraphQL during a hackathon. +**How it works**: Define subgraph schemas, provide OpenAPI documents or curl commands to an AI coding assistant (Cursor, Copilot, Windsurf), and let the LLM generate adapter code against the strongly-typed proto definitions. +**Outcome**: Multiple API integrations completed in hours instead of weeks, with type-safe guarantees. + +--- + +## Competitive Positioning + +### Key Differentiators +1. **No GraphQL Runtime Required**: Unlike Apollo Federation which requires GraphQL servers for each subgraph, Cosmo Connect uses gRPC natively. +2. **LLM-Friendly Architecture**: Strongly-typed proto definitions enable AI coding assistants to generate adapter code reliably. +3. **Built-in Batching**: DataLoader integration eliminates N+1 problems that plague declarative connector approaches. + +### Comparison with Alternatives + +| Aspect | Cosmo Connect | Apollo Connectors | Traditional Subgraphs | +|--------|---------------|-------------------|----------------------| +| Backend Knowledge Required | gRPC only | REST/HTTP mapping | GraphQL + Federation | +| Language Support | Any gRPC language | N/A (declarative) | Limited by library quality | +| Performance | Native gRPC batching | N+1 prone | Framework-dependent | +| Migration Effort | Low (wrap existing) | Medium | High (rewrite) | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Our team already knows GraphQL" | Cosmo Connect is fully compatible with existing Apollo Federation subgraphs - use both approaches together. | +| "gRPC adds complexity" | gRPC is simpler than GraphQL for most backend teams, and code generation handles the complexity. | +| "We need real-time subscriptions" | Subscriptions support is on the roadmap; use traditional subgraphs for subscription-heavy services today. | + +--- + +## Technical Summary + +### How It Works +Cosmo Connect works by compiling GraphQL schemas into Protocol Buffer definitions. Developers define a standard Apollo-compatible subgraph schema, then use the Cosmo CLI to generate protobuf files and mapping configurations. Backend teams implement the generated gRPC service interfaces in their preferred language. At runtime, the Cosmo Router translates GraphQL operations into gRPC calls, batches requests via DataLoader, and assembles responses. + +### Key Technical Features +- Schema-first GraphQL to protobuf compilation +- Automatic code generation for multiple languages +- DataLoader integration for request batching +- Hot-reload support for plugins +- Full Apollo Federation compatibility +- Support for entities, keys, and cross-service field resolution + +### Integration Points +- Cosmo Router (required) +- Cosmo CLI (wgc) for code generation +- Any gRPC-compatible language runtime +- Existing REST, SOAP, or database backends + +### Requirements & Prerequisites +- Cosmo Router deployment +- Cosmo CLI installed +- gRPC runtime for your chosen language +- Basic understanding of Protocol Buffers + +--- + +## Proof Points + +### Metrics & Benchmarks +- Eliminates GraphQL framework overhead with native gRPC communication +- DataLoader batching prevents N+1 query patterns +- Hot-reload capability enables zero-downtime plugin updates + +### Case Studies +- See the [Cosmo Plugin Demo](https://github.com/wundergraph/cosmo-plugin-demo) for a complete working example + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/connect/overview` +- Router Plugins: `/docs/router/gRPC/plugins` +- gRPC Services: `/docs/router/gRPC/grpc-services` +- gRPC Concepts: `/docs/router/gRPC/concepts` +- Tutorial (Plugins): `/docs/tutorial/using-grpc-plugins` +- Tutorial (Services): `/docs/tutorial/grpc-service-quickstart` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL to gRPC +- GraphQL Federation without GraphQL +- gRPC GraphQL integration + +### Secondary Keywords +- Protocol Buffer GraphQL +- Federation API gateway +- Subgraph alternatives + +### Related Search Terms +- How to add REST API to GraphQL Federation +- GraphQL Federation without rewriting services +- gRPC microservices GraphQL +- Apollo Federation alternatives + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/grpc/grpc-services.md b/capabilities/grpc/grpc-services.md new file mode 100644 index 00000000..8760216d --- /dev/null +++ b/capabilities/grpc/grpc-services.md @@ -0,0 +1,213 @@ +# gRPC Services + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-grpc-003` | +| **Category** | gRPC | +| **Status** | Beta | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-grpc-001`, `cap-grpc-002` | + +--- + +## Quick Reference + +### Name +gRPC Services + +### Tagline +Independent gRPC services for distributed GraphQL Federation. + +### Elevator Pitch +gRPC Services enable you to deploy independent microservices that integrate into your GraphQL Federation through gRPC protocol. Define a GraphQL schema, generate Protocol Buffer definitions, and implement in any gRPC-supported language. Services scale independently, deploy anywhere in your infrastructure, and maintain full team autonomy while participating in the unified supergraph. + +--- + +## Problem & Solution + +### The Problem +Organizations with distributed teams and microservices architectures need their services to scale independently, deploy across different environments, and use different languages based on team expertise. Traditional GraphQL Federation approaches either couple all services together or require every team to adopt GraphQL frameworks, limiting flexibility and creating organizational friction. + +### The Solution +gRPC Services are standalone microservices that communicate with the Cosmo Router over the network using standard gRPC protocol. Each service can be deployed, scaled, and managed independently in any infrastructure. Teams implement services in any gRPC-supported language (Python, Java, Go, C#, Node.js, Rust, etc.) while the Router handles all GraphQL translation and coordination. Field resolvers enable custom resolution logic with automatic batching to prevent N+1 problems. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| All services must use GraphQL frameworks | Services implement gRPC contracts in any language | +| Services coupled to gateway deployment | Independent deployment and release cycles | +| Single scaling model for all services | Each service scales based on its own requirements | +| Centralized team must manage integrations | Teams own their services end-to-end | + +--- + +## Key Benefits + +1. **Language Flexibility**: Implement services in any language that supports gRPC - Python, Java, C#, Node.js, Rust, Go, and many others. Choose the best language for each service's requirements. + +2. **Independent Scaling**: Scale each service independently based on its specific load patterns and resource requirements without affecting the router or other services. + +3. **Team Autonomy**: Different teams can own and operate their services independently using their preferred languages, frameworks, and deployment strategies with separate release cycles. + +4. **Distributed Architecture**: Deploy services across different environments, datacenters, or cloud regions. Services can live anywhere in your infrastructure. + +5. **Field Resolvers with Batching**: Implement custom field resolution logic with automatic DataLoader batching. Computed fields, complex transformations, and external data integration without N+1 problems. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Architect / Engineering Manager +- **Pain Points**: Need to integrate services from multiple teams with different tech stacks; services have different scaling requirements; teams want ownership over their deployments +- **Goals**: Create a unified API experience while maintaining microservices independence and team autonomy + +### Secondary Personas +- Backend developers in multi-language organizations +- DevOps engineers managing distributed service deployments +- Teams migrating from monoliths to microservices + +--- + +## Use Cases + +### Use Case 1: Multi-Language Microservices Architecture +**Scenario**: A large organization has services written by different teams in Python (data science), Java (enterprise), Go (infrastructure), and Node.js (frontend BFF). +**How it works**: Each team defines their GraphQL subgraph schema, generates protobuf definitions using `wgc grpc-service generate`, implements the gRPC service in their preferred language, and deploys independently. The Router discovers and connects to all services over the network. +**Outcome**: Unified GraphQL API across all services while teams maintain complete technology independence and deployment autonomy. + +### Use Case 2: Independent Service Scaling +**Scenario**: An e-commerce platform's product catalog service handles 100x more traffic than the order management service during sales events. +**How it works**: Both services are deployed as independent gRPC services with their own auto-scaling policies. During high-traffic events, the catalog service scales horizontally while the order service maintains baseline capacity. +**Outcome**: Optimal resource utilization with services scaling independently based on actual demand, reducing infrastructure costs. + +### Use Case 3: Custom Field Resolution +**Scenario**: A service needs to provide computed fields that aggregate data from multiple sources based on field arguments. +**How it works**: Define field resolvers using the `@connect__fieldResolver` directive with context parameters. The generated protobuf includes RPC methods for each field resolver. Implement the resolution logic with access to both field arguments and parent context. The router automatically batches requests across entities. +**Outcome**: Complex computed fields (popularity scores, aggregations, transformations) with optimal performance and no N+1 problems. + +--- + +## Competitive Positioning + +### Key Differentiators +1. **True Language Agnosticism**: Unlike plugins (Go/TypeScript only), gRPC Services support any language with gRPC support. +2. **Field Resolvers with Batching**: Custom field resolution with automatic request batching, unlike declarative approaches prone to N+1. +3. **Schema-First Contract**: GraphQL schema serves as the source of truth with automatically generated, strongly-typed protobuf contracts. + +### Comparison with Alternatives + +| Aspect | gRPC Services | Router Plugins | Traditional Subgraphs | +|--------|---------------|----------------|----------------------| +| Language Support | Any gRPC language | Go, TypeScript (Bun) | Limited by framework | +| Deployment | Distributed microservices | Co-located with router | Distributed | +| Scaling | Independent per service | Coupled to router | Independent | +| Team Autonomy | High | Low | Medium | +| Latency | Network overhead | Minimal (IPC) | Network + GraphQL | +| Field Resolvers | Full support with batching | Full support | Framework-dependent | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "gRPC adds network latency" | For performance-critical paths, use Router Plugins. gRPC Services are optimized for distributed architectures where independence matters more than minimal latency. | +| "We already have GraphQL subgraphs" | gRPC Services are 100% compatible with existing Apollo Federation subgraphs - run both in the same federated graph. | +| "Managing distributed services is complex" | gRPC Services follow standard microservices patterns - use your existing Kubernetes, service mesh, and observability tools. | + +--- + +## Technical Summary + +### How It Works +gRPC Services are independent deployments that expose gRPC endpoints implementing the generated protobuf service definitions. The Cosmo Router discovers services, establishes network connections, and translates GraphQL operations into gRPC calls. Services handle their own scaling, monitoring, and lifecycle management. Field resolvers execute as dedicated RPC methods with automatic request batching via DataLoader. + +### Key Technical Features +- **Protocol Buffer Generation**: Automatic generation from GraphQL schemas using `wgc grpc-service generate` +- **Field Resolvers**: Custom resolution logic via `@connect__fieldResolver` directive with context and argument support +- **Automatic Batching**: DataLoader integration batches field resolver requests across entities +- **Entity Lookups**: Support for single keys, multiple keys, and compound keys in federation +- **Type Support**: Full support for scalars, enums, interfaces, unions, recursive types, nested objects, and complex lists +- **Schema Linting**: Validation against gRPC compatibility requirements with clear error/warning reporting + +### Integration Points +- Cosmo Router (network communication) +- Cosmo CLI (wgc) for code generation +- Any gRPC-compatible language runtime +- Standard service discovery mechanisms +- Kubernetes, Docker, or any deployment platform + +### Requirements & Prerequisites +- Cosmo Router deployment +- Cosmo CLI (wgc) installed +- gRPC runtime for your chosen language +- Network connectivity between router and services +- Standard service deployment infrastructure + +--- + +## Proof Points + +### Metrics & Benchmarks +- Field resolver batching eliminates N+1 query patterns +- Independent scaling enables optimal resource utilization +- Protocol Buffer binary protocol provides efficient serialization + +### Case Studies +- See the [gRPC Service Quickstart Tutorial](/tutorial/grpc-service-quickstart) for implementation examples + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/gRPC/grpc-services` +- Cosmo Connect overview: `/docs/connect/grpc-services` +- gRPC Concepts: `/docs/router/gRPC/concepts` +- Field Resolvers: `/docs/router/gRPC/field-resolvers` +- GraphQL Support: `/docs/router/gRPC/graphql-support` +- Tutorial: `/docs/tutorial/grpc-service-quickstart` + +--- + +## Keywords & SEO + +### Primary Keywords +- gRPC GraphQL Federation +- GraphQL microservices +- Protocol Buffer GraphQL + +### Secondary Keywords +- Distributed GraphQL services +- GraphQL field resolvers +- Multi-language GraphQL Federation + +### Related Search Terms +- How to integrate gRPC with GraphQL +- GraphQL Federation microservices architecture +- Independent GraphQL subgraph scaling +- Custom GraphQL field resolution + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/grpc/router-plugins.md b/capabilities/grpc/router-plugins.md new file mode 100644 index 00000000..a3a7a1f8 --- /dev/null +++ b/capabilities/grpc/router-plugins.md @@ -0,0 +1,215 @@ +# Router Plugins + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-grpc-002` | +| **Category** | gRPC | +| **Status** | Beta | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-grpc-001`, `cap-grpc-003` | + +--- + +## Quick Reference + +### Name +Router Plugins + +### Tagline +In-process gRPC extensions for high-performance GraphQL Federation. + +### Elevator Pitch +Router Plugins are local processes managed by the Cosmo Router that extend your federated graph with custom functionality. Built on HashiCorp's battle-tested go-plugin framework, they provide the simplest deployment model with the highest performance, complete with Go and TypeScript SDKs that include HTTP client utilities, distributed tracing, and structured logging. + +--- + +## Problem & Solution + +### The Problem +Organizations adopting GraphQL Federation often need to integrate external APIs, legacy systems, or custom business logic into their supergraph. Traditional approaches require deploying and managing separate GraphQL subgraph services, adding infrastructure complexity, operational overhead, and network latency. Teams also struggle with the varying quality and spec compliance of different subgraph framework implementations. + +### The Solution +Router Plugins run as local processes managed by the Cosmo Router, eliminating the need for separate service deployments. They communicate via high-performance gRPC with critical fault isolation - if a plugin crashes, it won't bring down the router. The Go and TypeScript SDKs provide production-ready utilities including HTTP clients with middleware support, automatic distributed tracing integration, and structured logging that integrates seamlessly with the router's observability stack. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Deploy separate subgraph services for each integration | Run plugins alongside the router with unified deployment | +| Manage multiple CI/CD pipelines and infrastructure | Single deployment unit with hot-reload support | +| Network latency between router and subgraphs | Direct inter-process communication for minimal latency | +| Varying subgraph framework quality and spec compliance | Strongly-typed proto definitions guarantee correctness | + +--- + +## Key Benefits + +1. **Simplified Architecture**: Maintain fewer components with unified deployment and monitoring. Run multiple plugins on the same Router instance. + +2. **Maximum Performance**: Achieve significantly lower latency with direct gRPC-based inter-process communication. Network and GraphQL framework overhead is eliminated. + +3. **Production-Ready SDKs**: Go SDK includes HTTP client with middleware, automatic tracing propagation, and structured logging. TypeScript (Bun) support provides similar capabilities. + +4. **Fault Isolation**: Plugins run as separate processes - if a plugin crashes, the router continues operating and can restart the plugin. + +5. **Hot Reload Support**: Update plugins without service interruption. The router manages plugin lifecycle including hot-reloading. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Backend Developer +- **Pain Points**: Too much infrastructure to manage; slow iteration cycles for API integrations; need lowest possible latency +- **Goals**: Rapidly integrate external APIs into the supergraph with minimal operational overhead + +### Secondary Personas +- DevOps engineers seeking to reduce infrastructure complexity +- Go developers comfortable with the language and ecosystem +- Teams using AI coding assistants who benefit from strongly-typed interfaces + +--- + +## Use Cases + +### Use Case 1: Rapid API Integration with AI Assistance +**Scenario**: A team needs to expose a third-party REST API through their GraphQL supergraph during a sprint. +**How it works**: Define a GraphQL schema representing the API, run `wgc router plugin init` to scaffold the project, provide the OpenAPI spec to an AI coding assistant, and let it generate the adapter code against the strongly-typed proto definitions. Use `wgc router plugin build` and `make` to build, compose, and serve. +**Outcome**: A production-ready integration completed in hours with automatic tracing and logging, zero infrastructure to deploy. + +### Use Case 2: Performance-Critical Data Access +**Scenario**: An e-commerce platform needs sub-millisecond access to product inventory data from a Redis cache. +**How it works**: Implement a router plugin that directly queries Redis using the Go SDK's HTTP client (or native Redis client). The plugin runs in the same process group as the router, eliminating network hops. +**Outcome**: Inventory queries complete in microseconds, enabling real-time stock updates without the overhead of a separate service. + +### Use Case 3: Legacy System Wrapper +**Scenario**: A financial institution has a SOAP service that must be integrated into the modern GraphQL API without modification. +**How it works**: Create a plugin that wraps the SOAP service, using the SDK's HTTP client to make SOAP calls. The strongly-typed proto interface ensures the GraphQL schema matches the implementation. Tracing automatically captures the full request flow. +**Outcome**: Legacy SOAP service seamlessly integrated with full observability, no changes to existing systems. + +--- + +## Competitive Positioning + +### Key Differentiators +1. **HashiCorp go-plugin Foundation**: Built on the same framework powering Vault and Terraform, with millions of production deployments. +2. **Integrated Observability**: Automatic tracing propagation and structured logging without additional configuration. +3. **LLM-Optimized Development**: Proto-based code generation creates a strongly-typed foundation that AI tools can effectively understand and extend. + +### Comparison with Alternatives + +| Aspect | Router Plugins | Apollo Connectors | Standalone Subgraphs | +|--------|----------------|-------------------|---------------------| +| Deployment | Co-located with router | N/A (declarative) | Separate services | +| Latency | Minimal (IPC) | Network + parsing | Network + GraphQL | +| Language Support | Go, TypeScript (Bun) | N/A | Many (varying quality) | +| Observability | Built-in tracing/logging | Manual setup | Manual setup | +| Batching | DataLoader built-in | N+1 prone | Framework-dependent | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We need language flexibility" | Use gRPC Services for multi-language needs; plugins are optimized for Go/TypeScript teams. | +| "What if a plugin crashes?" | Plugins run as separate processes with fault isolation - the router continues operating and can restart plugins. | +| "We need independent scaling" | For high-scale scenarios, gRPC Services offer independent deployment and scaling. | + +--- + +## Technical Summary + +### How It Works +Router Plugins are separate processes that communicate with the Cosmo Router via gRPC using HashiCorp's go-plugin framework. At startup, plugins register with the router and their schemas integrate into the federated graph. When GraphQL requests arrive, the router routes relevant portions to appropriate plugins over the IPC channel. The router manages plugin lifecycle including startup, health checking, and hot-reloading. + +### Key Technical Features +- **HTTP Client SDK**: Fluent API with middleware support (auth, user-agent, custom), generic response handling, and automatic tracing integration +- **Distributed Tracing**: OpenTelemetry-based tracing with automatic context propagation from router through plugins to downstream services +- **Structured Logging**: Integration with router's zap logger, context injection, panic recovery with stack traces +- **Health Checks**: Built-in gRPC health check protocol support +- **Hot Reload**: Update plugins without service interruption +- **Cosmo Cloud Registry**: Push plugins directly to the platform without re-deploying the router + +### Integration Points +- Cosmo Router (manages plugin lifecycle) +- Cosmo CLI (wgc) for scaffolding, building, and testing +- Cosmo Cloud Plugin Registry for deployment +- OpenTelemetry for distributed tracing +- Any HTTP/REST/SOAP backend via SDK HTTP client + +### Requirements & Prerequisites +- Cosmo Router deployment +- Cosmo CLI (wgc) installed +- Go compiler or Bun runtime (installed automatically by CLI) +- Protobuf compiler (installed automatically by CLI) + +--- + +## Proof Points + +### Metrics & Benchmarks +- Eliminates network latency with inter-process communication +- HashiCorp go-plugin framework powers production systems like Vault and Terraform +- DataLoader batching prevents N+1 query patterns +- Hot-reload enables zero-downtime updates + +### Case Studies +- See the [Cosmo Plugin Demo](https://github.com/wundergraph/cosmo-plugin-demo) for a complete working example with Users plugin and Products subgraph + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/gRPC/plugins` +- Cosmo Connect overview: `/docs/connect/plugins` +- Go Plugin SDK: `/docs/router/gRPC/plugins/go-plugin/overview` +- HTTP Client: `/docs/router/gRPC/plugins/go-plugin/http-client` +- Telemetry: `/docs/router/gRPC/plugins/go-plugin/telemetry` +- Logging: `/docs/router/gRPC/plugins/go-plugin/logging` +- TypeScript Plugin: `/docs/router/gRPC/plugins/ts-plugin/overview` +- CLI Commands: `/docs/cli/router/plugin` +- Tutorial: `/docs/tutorial/using-grpc-plugins` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL router plugins +- gRPC GraphQL integration +- In-process GraphQL extensions + +### Secondary Keywords +- HashiCorp go-plugin GraphQL +- GraphQL Federation plugins +- Cosmo Router extensions + +### Related Search Terms +- How to extend GraphQL gateway +- GraphQL subgraph alternatives +- High-performance GraphQL resolvers +- GraphQL API integration patterns + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/index.md b/capabilities/index.md new file mode 100644 index 00000000..2005ec34 --- /dev/null +++ b/capabilities/index.md @@ -0,0 +1,276 @@ +# Cosmo Capabilities Index + +> **Status**: Draft for review - 94 capabilities identified (consolidated) +> +> This index lists all capabilities extracted from the documentation. Review and refine before applying the full template to each capability. + +--- + +## Federation & Schema Composition + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **GraphQL Federation v1 & v2** | Full support for both Federation protocol versions | `/docs/router/intro.mdx`, `/docs/federation/federation-compatibility-matrix.mdx` | +| **Schema Composition** | Automatic composition of federated graphs from multiple subgraphs with error detection | `/docs/studio/compositions.mdx`, `/docs/studio/schema-checks.mdx` | +| **Schema Checks** | Pre-deployment validation including composition errors, breaking changes, and lint rules | `/docs/studio/schema-checks.mdx`, `/docs/cli/subgraph/check.mdx` | +| **Schema Registry** | Centralized schema management with version history and comparison | `/docs/studio/schema-registry.mdx` | +| **Schema Contracts** | Filter graph sections for different audiences using @tag directives | `/docs/concepts/schema-contracts.mdx`, `/docs/studio/schema-contracts.mdx`, `/docs/cli/schema-contracts.mdx` | +| **Federation Directives** | Extended directive support (@shareable, @authenticated, @requiresScopes, etc.) | `/docs/federation/federation-directives-index.mdx`, `/docs/federation/directives/` | +| **Subgraph Management** | Create, publish, update, delete subgraphs with full lifecycle management | `/docs/cli/subgraph.mdx`, `/docs/studio/` | +| **Monograph Support** | Single-service GraphQL without federation complexity | `/docs/cli/monograph.mdx` | + +--- + +## Router & Query Execution + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **GraphQL Federation Router** | High-performance Go-based router with query planning and optimization | `/docs/router/intro.mdx` | +| **Query Planning** | Intelligent query plan generation for federated execution | `/docs/router/query-plan.mdx`, `/docs/router/query-plan/batch-generate-query-plans.mdx` | +| **Query Batching** | Execute multiple operations in a single request with configurable concurrency | `/docs/router/query-batching.mdx` | +| **Router Configuration** | YAML-based config with env var expansion and JSON schema validation | `/docs/router/configuration.mdx`, `/docs/router/configuration/config-design.mdx` | +| **Config Hot Reload** | Update router configuration at runtime without downtime | `/docs/router/deployment/config-hot-reload.mdx` | +| **Development Mode** | Development-optimized settings with verbose error output | `/docs/router/development.mdx`, `/docs/router/development/development-mode.mdx` | + +--- + +## Real-Time & Subscriptions + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **GraphQL Subscriptions** | Multiple protocol support (graphql-ws, SSE, multipart HTTP) with connection multiplexing | `/docs/router/subscriptions.mdx` | +| **Cosmo Streams (EDFS)** | Event-driven federated subscriptions with Kafka, NATS, and Redis integration | `/docs/router/cosmo-streams.mdx`, `/docs/federation/event-driven-federated-subscriptions.mdx` | + +--- + +## Traffic Management & Reliability + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Traffic Shaping** | Comprehensive traffic control with retries, timeouts, and circuit breakers | `/docs/router/traffic-shaping.mdx` | +| **Retry Mechanism** | Configurable retry policies with exponential backoff | `/docs/router/traffic-shaping/retry.mdx` | +| **Timeout Configuration** | Request and per-subgraph timeout management | `/docs/router/traffic-shaping/timeout.mdx` | +| **Circuit Breaker** | Fault tolerance with automatic circuit state management | `/docs/router/traffic-shaping/circuit-breaker.mdx` | + +--- + +## Performance & Caching + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Persisted Operations** | Pre-register operations for security and performance | `/docs/router/persisted-queries.mdx`, `/docs/router/persisted-queries/persisted-operations.mdx` | +| **Automatic Persisted Queries (APQ)** | Hash-based query execution with automatic caching | `/docs/router/persisted-queries/automatic-persisted-queries-apq.mdx` | +| **Cache Warmer** | Pre-warm query plan cache for optimal performance | `/docs/concepts/cache-warmer.mdx` | +| **Cache Control** | CDN-friendly cache header management | `/docs/router/proxy-capabilities/adjusting-cache-control.mdx` | +| **Performance Debugging** | Tools for identifying and resolving performance bottlenecks | `/docs/router/performance-debugging.mdx` | + +--- + +## Observability + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **OpenTelemetry (OTEL)** | Full OTEL support for tracing and metrics with HTTP/gRPC exporters | `/docs/router/open-telemetry.mdx`, `/docs/router/open-telemetry/custom-attributes.mdx` | +| **Distributed Tracing** | End-to-end request tracing across federated services | `/docs/studio/analytics/distributed-tracing.mdx` | +| **Advanced Request Tracing (ART)** | Detailed execution plan tracing with Playground visualization | `/docs/router/advanced-request-tracing-art.mdx` | +| **Prometheus Metrics** | R.E.D method metrics (Rate, Errors, Duration) with custom labels | `/docs/router/metrics-and-monitoring.mdx`, `/docs/router/metrics-and-monitoring/prometheus-metric-reference.mdx` | +| **Grafana Integration** | Pre-built dashboards for metrics visualization | `/docs/router/metrics-and-monitoring/grafana.mdx` | +| **OTEL Collector Integration** | Collector setup for data aggregation | `/docs/router/open-telemetry/setup-opentelemetry-collector.mdx` | +| **Access Logs** | Configurable request logging to stdout or file | `/docs/router/access-logs.mdx` | +| **Profiling (pprof)** | CPU, memory, goroutine, and block profiling | `/docs/router/profiling.mdx` | + +--- + +## Analytics & Insights + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Analytics Dashboard** | Request metrics with filtering, grouping, and date range selection | `/docs/studio/analytics.mdx` | +| **Metrics Analytics** | Request rate, error tracking, and latency analysis | `/docs/studio/analytics/metrics.mdx` | +| **Trace Analytics** | Individual trace inspection with timeline visualization | `/docs/studio/analytics/traces.mdx` | +| **Schema Field Usage** | Track field popularity and detect unused fields | `/docs/studio/analytics/schema-field-usage.mdx` | +| **Client Identification** | Track client versions and usage patterns | `/docs/studio/analytics/client-identification.mdx` | +| **Operations Tracking** | Monitor and analyze registered operations | `/docs/studio/operations.mdx` | + +--- + +## Security + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **JWT Authentication** | JWKS-based JWT validation with multiple providers | `/docs/router/authentication-and-authorization.mdx` | +| **Authorization Directives** | Field-level auth with @authenticated and @requiresScopes | `/docs/federation/directives/authenticated.mdx`, `/docs/federation/directives/requiresscopes.mdx` | +| **TLS/HTTPS** | Encrypted communication with certificate management | `/docs/router/security/tls.mdx` | +| **Config Signing** | HMAC-SHA256 signature verification for tamper prevention | `/docs/router/security/config-validation-and-signing.mdx` | +| **Security Hardening** | Best practices for production deployments | `/docs/router/security/hardening-guide.mdx` | +| **Introspection Control** | Disable schema introspection in production | `/docs/router/security.mdx` | +| **Subgraph Error Propagation** | Control error exposure with sensitive data masking | `/docs/router/subgraph-error-propagation.mdx` | + +--- + +## Access Control & Identity + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Role-Based Access Control (RBAC)** | Granular permission management by role | `/docs/studio/rbac.mdx` | +| **Groups & Group Rules** | User group management with SSO rule mapping | `/docs/studio/groups.mdx`, `/docs/studio/groups/group-rules.mdx` | +| **API Keys** | Granular API key permissions with resource-level access | `/docs/studio/api-keys.mdx`, `/docs/studio/api-keys/api-key-permissions.mdx`, `/docs/studio/api-keys/api-key-resources.mdx` | +| **Single Sign-On (SSO)** | OIDC support for Okta, Auth0, Keycloak, Microsoft Entra | `/docs/studio/sso.mdx` | +| **SCIM Provisioning** | Automated user provisioning and deprovisioning | `/docs/studio/scim.mdx`, `/docs/studio/scim/okta.mdx` | +| **Audit Logging** | Complete audit trail of all user and API actions | `/docs/studio/audit-log.mdx` | +| **User Invitations** | Team member onboarding and collaboration | `/docs/studio/invitations.mdx` | +| **Session Management** | User session tracking and activity monitoring | `/docs/studio/sessions.mdx` | + +--- + +## Compliance & Data Privacy + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Compliance Certifications** | SOC 2 Type II, ISO 27001, GDPR, HIPAA support | `/docs/security-and-compliance.mdx`, `/docs/router/compliance-and-data-management.mdx` | +| **IP Anonymization** | Redact or hash IP addresses for privacy | `/docs/router/compliance-and-data-management.mdx` | +| **Advanced Data Privacy** | Field-level data obfuscation with custom renderers | `/docs/router/advanced-data-privacy.mdx` | +| **Variable Export Control** | Control which variables are exported in telemetry | `/docs/router/compliance-and-data-management.mdx` | + +--- + +## Developer Experience + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **GraphiQL Playground++** | Enhanced GraphQL IDE with ART visualization | `/docs/studio/playground.mdx` | +| **Custom Playground Scripts** | Pre/post-request hooks with dynamic variables | `/docs/studio/playground/custom-scripts.mdx` | +| **Shared Playground State** | Shareable sessions for team collaboration | `/docs/studio/playground/shared-playground-state.mdx` | +| **Schema Explorer** | Interactive schema browsing with search | `/docs/studio/schema-explorer.mdx` | +| **Changelog** | Track all graph modifications with attribution | `/docs/studio/changelog.mdx` | +| **Query Plan Visualization** | Visual query execution plans for debugging | `/docs/router/query-plan.mdx` | +| **Lint Policies** | Customizable schema linting rules | `/docs/studio/policies.mdx`, `/docs/studio/lint-policy/linter-rules.mdx` | +| **Graph Pruning** | Detect unused fields and enforce deprecations | `/docs/studio/graph-pruning.mdx` | +| **Breaking Change Overrides** | Manual override for approved breaking changes | `/docs/studio/overrides.mdx` | + +--- + +## Feature Flags & Progressive Delivery + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Feature Flags** | Runtime feature toggling with feature subgraphs for gradual rollout via headers, JWT claims, or cookies | `/docs/concepts/feature-flags.mdx`, `/docs/cli/feature-flags.mdx`, `/docs/cli/feature-subgraph.mdx` | + +--- + +## gRPC Integration (Cosmo Connect) + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Cosmo Connect** | GraphQL-to-gRPC protocol translation | `/docs/connect/overview.mdx` | +| **Router Plugins** | In-process plugins with Go and TypeScript SDKs (HTTP client, telemetry, logging) | `/docs/connect/plugins.mdx`, `/docs/router/gRPC/plugins.mdx`, `/docs/router/gRPC/plugins/go-plugin/overview.mdx`, `/docs/router/gRPC/plugins/go-plugin/http-client.mdx`, `/docs/router/gRPC/plugins/go-plugin/telemetry.mdx`, `/docs/router/gRPC/plugins/go-plugin/logging.mdx`, `/docs/router/gRPC/plugins/ts-plugin/overview.mdx` | +| **gRPC Services** | Independent gRPC service deployment with Protocol Buffers and field resolvers | `/docs/connect/grpc-services.mdx`, `/docs/router/gRPC/grpc-services.mdx`, `/docs/router/gRPC/concepts.mdx`, `/docs/router/gRPC/field-resolvers.mdx`, `/docs/router/gRPC/graphql-support.mdx` | + +--- + +## Extensibility + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Custom Modules (Go)** | Pure Go extensions with multiple hook interfaces | `/docs/router/custom-modules.mdx` | +| **Subgraph Check Extensions** | Custom validation logic for schema checks | `/docs/studio/subgraph-check-extensions.mdx` | + +--- + +## AI & LLMs + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **MCP (Model Context Protocol)** | Expose persisted operations to LLMs through MCP | `/docs/router/mcp.mdx` | + +--- + +## Proxy & Request Handling + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Request Header Operations** | Inject and manipulate request headers | `/docs/router/proxy-capabilities/request-headers-operations.mdx` | +| **Response Header Operations** | Control response headers to clients | `/docs/router/proxy-capabilities/response-header-operations.mdx` | +| **Forward Client Extensions** | Propagate extension fields to subgraphs | `/docs/router/proxy-capabilities/forward-client-extensions.mdx` | +| **Override Subgraph Config** | Dynamic runtime subgraph configuration | `/docs/router/proxy-capabilities/override-subgraph-config.mdx` | +| **File Upload** | GraphQL multipart request spec for file uploads | `/docs/router/file-upload.mdx` | + +--- + +## Deployment & Operations + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Cosmo Cloud** | Fully managed SaaS platform | `/docs/deployments-and-hosting/cosmo-cloud.mdx` | +| **Self-Hosted Deployment** | On-premise or private cloud deployment | `/docs/deployments-and-hosting/intro.mdx` | +| **Kubernetes (Helm)** | Helm charts for K8s deployment (EKS, AKS, GKE) | `/docs/deployments-and-hosting/kubernetes.mdx`, `/docs/deployments-and-hosting/kubernetes/helm-chart.mdx` | +| **Terraform** | Infrastructure as Code for AWS Fargate and more | `/docs/deployments-and-hosting/terraform.mdx`, `/docs/deployments-and-hosting/terraform/aws-fargate.mdx` | +| **Docker** | Container-based deployment | `/docs/deployments-and-hosting/docker.mdx` | +| **Storage Providers** | CDN, S3, and S3-compatible artifact storage | `/docs/router/storage-providers.mdx` | +| **Router Compatibility Versions** | Managed version compatibility | `/docs/concepts/router-compatibility-versions.mdx` | +| **Cluster Management** | Multi-cluster router administration | `/docs/studio/cluster-management.mdx` | + +--- + +## CLI (wgc) + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Cosmo CLI** | Comprehensive CLI for managing namespaces, subgraphs, federated graphs, router, plugins, gRPC services, operations, proposals, and authentication | `/docs/cli/intro.mdx`, `/docs/cli/essentials.mdx`, `/docs/cli/namespace.mdx`, `/docs/cli/subgraph.mdx`, `/docs/cli/federated-graph.mdx`, `/docs/cli/monograph.mdx`, `/docs/cli/router.mdx`, `/docs/cli/router/plugin.mdx`, `/docs/cli/grpc-service.mdx`, `/docs/cli/operations.mdx`, `/docs/cli/proposal.mdx`, `/docs/cli/auth.mdx` | + +--- + +## Notifications & Integrations + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Alerts & Notifications** | Multi-channel alerting for schema changes and errors | `/docs/studio/alerts-and-notifications.mdx` | +| **Webhook Notifications** | Custom webhook integration for events | `/docs/studio/alerts-and-notifications/webhooks.mdx` | +| **Slack Integration** | Direct Slack channel notifications | `/docs/studio/alerts-and-notifications/slack-integration.mdx` | + +--- + +## Migration & Compatibility + +| Capability | Description | Source Files | +|------------|-------------|--------------| +| **Apollo Migration** | Migration guides from Apollo Federation | `/docs/studio/migrate-from-apollo.mdx` | +| **Apollo Router Migration** | Migration path from Apollo Router | `/docs/studio/migrate-from-apollo.mdx` | +| **Federation Compatibility** | Federation v1/v2 compatibility matrix | `/docs/federation/federation-compatibility-matrix.mdx` | + +--- + +## Summary + +| Category | Count | +|----------|-------| +| Federation & Schema Composition | 8 | +| Router & Query Execution | 6 | +| Real-Time & Subscriptions | 2 | +| Traffic Management & Reliability | 4 | +| Performance & Caching | 5 | +| Observability | 8 | +| Analytics & Insights | 6 | +| Security | 7 | +| Access Control & Identity | 8 | +| Compliance & Data Privacy | 4 | +| Developer Experience | 9 | +| Feature Flags & Progressive Delivery | 1 | +| gRPC Integration (Cosmo Connect) | 3 | +| Extensibility | 2 | +| AI & LLMs | 1 | +| Proxy & Request Handling | 5 | +| Deployment & Operations | 8 | +| CLI (wgc) | 1 | +| Notifications & Integrations | 3 | +| Migration & Compatibility | 3 | +| **Total** | **94** | + +--- + +## Next Steps + +1. **Review this index** - Validate capability names and groupings +2. **Prioritize** - Identify high-value capabilities for detailed documentation +3. **Apply template** - Create full capability docs using `template.md` +4. **Cross-reference** - Ensure all source files are correctly linked \ No newline at end of file diff --git a/capabilities/migration/apollo-migration.md b/capabilities/migration/apollo-migration.md new file mode 100644 index 00000000..0ad03925 --- /dev/null +++ b/capabilities/migration/apollo-migration.md @@ -0,0 +1,156 @@ +# Apollo GraphOS Migration + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-mig-001` | +| **Category** | Migration | +| **Status** | GA | +| **Availability** | Free | +| **Related Capabilities** | `cap-mig-002`, `cap-mig-003` | + +--- + +## Quick Reference + +### Name +Apollo GraphOS Migration + +### Tagline +Migrate from Apollo GraphOS with a single click. + +### Elevator Pitch +WunderGraph Cosmo provides a seamless one-click migration path from Apollo GraphOS. Simply provide your Graph API Key, and Cosmo automatically imports your federated graph configuration, subgraphs, and schema—getting you up and running in seconds without manual reconfiguration. + +--- + +## Problem & Solution + +### The Problem +Organizations using Apollo GraphOS who want to switch to WunderGraph Cosmo face the daunting task of manually recreating their entire federated graph setup. This includes re-registering all subgraphs, reconfiguring routing rules, and ensuring schema compatibility—a process that can take days or weeks and introduces significant risk of errors. + +### The Solution +Cosmo's Apollo Migration feature automates the entire transition process. By providing your Apollo Graph API Key, Cosmo fetches your existing graph configuration and recreates it automatically. Your API key is never stored—it's only used temporarily during the migration process, ensuring security while delivering convenience. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Days of manual subgraph registration | One-click automated migration | +| Risk of configuration errors | Accurate replication of existing setup | +| Downtime during transition | Seamless parallel operation possible | +| Manual schema transfer | Automatic schema import | + +--- + +## Key Benefits + +1. **One-Click Migration**: Complete your migration in seconds, not days, with a fully automated process +2. **Zero Configuration Risk**: Automatic import ensures your federated graph is replicated exactly as configured in Apollo +3. **Secure Process**: Your API key is never stored—only used temporarily for the migration fetch +4. **No Downtime Required**: Run both platforms in parallel during your transition period +5. **Immediate Productivity**: Start using Cosmo Studio features immediately after migration + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / DevOps Lead +- **Pain Points**: Concerned about migration complexity, timeline, and potential disruption to production services +- **Goals**: Minimize migration risk while maximizing speed; maintain service continuity during transition + +### Secondary Personas +- Engineering Managers evaluating platform alternatives +- API architects seeking better federation tooling +- Teams looking to reduce GraphQL platform costs + +--- + +## Use Cases + +### Use Case 1: Full Platform Migration +**Scenario**: A company decides to move from Apollo GraphOS to Cosmo for better pricing and features +**How it works**: Navigate to Cosmo Studio, click "Migrate from Apollo", enter Graph API Key and variant name, click "Migrate" +**Outcome**: Complete federated graph with all subgraphs imported and ready for use within seconds + +### Use Case 2: Parallel Evaluation +**Scenario**: An engineering team wants to evaluate Cosmo alongside their existing Apollo setup +**How it works**: Use the migration tool to create an identical setup in Cosmo without affecting Apollo production; compare features and performance +**Outcome**: Risk-free evaluation with production-identical configuration + +### Use Case 3: Multi-Variant Migration +**Scenario**: A team manages multiple graph variants (dev, staging, prod) in Apollo +**How it works**: Run the migration process for each variant, specifying the appropriate variant name each time +**Outcome**: Complete environment parity across all variants in Cosmo + +--- + +## Competitive Positioning + +### Key Differentiators +1. Automated one-click migration vs. manual reconfiguration +2. Secure, temporary API key usage—no credentials stored +3. Complete graph configuration import including all subgraphs + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Is my API key safe?" | Your API key is never stored—it's only used temporarily to fetch migration data | +| "Will my schemas transfer correctly?" | The migration automatically imports your complete schema configuration | +| "What about our multiple variants?" | You can migrate each variant separately, maintaining environment isolation | + +--- + +## Technical Summary + +### How It Works +The migration process uses your Apollo Graph API Key to authenticate and fetch your federated graph configuration from Apollo GraphOS. This includes subgraph definitions, schema information, and graph metadata. Cosmo then recreates this configuration in your Cosmo organization, establishing equivalent subgraphs and federated graph structure. + +### Key Technical Features +- OAuth-based secure API key authentication +- Complete subgraph configuration import +- Schema and directive preservation +- Variant-specific migration support + +### Requirements & Prerequisites +- Active Apollo GraphOS account with configured graphs +- Graph API Key with read permissions +- Cosmo account (free tier or above) + +--- + +## Documentation References + +- Primary docs: `/docs/studio/migrate-from-apollo` +- Getting started: `/docs/getting-started` +- Federation overview: `/docs/federation` + +--- + +## Keywords & SEO + +### Primary Keywords +- Apollo migration +- GraphOS migration +- Apollo to Cosmo + +### Secondary Keywords +- Federation migration +- GraphQL platform migration +- Apollo alternative + +### Related Search Terms +- How to migrate from Apollo GraphOS +- Apollo Federation migration guide +- Switch from Apollo to WunderGraph + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/migration/apollo-router-migration.md b/capabilities/migration/apollo-router-migration.md new file mode 100644 index 00000000..fb95278d --- /dev/null +++ b/capabilities/migration/apollo-router-migration.md @@ -0,0 +1,163 @@ +# Apollo Router Migration + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-mig-002` | +| **Category** | Migration | +| **Status** | GA | +| **Availability** | Free | +| **Related Capabilities** | `cap-mig-001`, `cap-mig-003` | + +--- + +## Quick Reference + +### Name +Apollo Router Migration + +### Tagline +Replace Apollo Router with Cosmo Router seamlessly. + +### Elevator Pitch +Transitioning from Apollo Router to Cosmo Router is straightforward thanks to full Federation compatibility. Once you've migrated your graph configuration using Cosmo's one-click migration, simply deploy the Cosmo Router with your new configuration—no subgraph changes required. + +--- + +## Problem & Solution + +### The Problem +Teams running Apollo Router want to switch to Cosmo Router but worry about compatibility issues, configuration differences, and potential service disruptions. The thought of reconfiguring routing rules, updating subgraph connections, and ensuring query compatibility creates hesitation around making the switch. + +### The Solution +Cosmo Router is designed with Federation compatibility at its core, supporting both Federation v1 and v2 directives. Combined with the one-click graph migration from Apollo GraphOS, teams can transition their router infrastructure without modifying subgraphs or client applications. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Concerns about directive compatibility | Full Federation v1 and v2 support | +| Manual router configuration | Automated configuration from migration | +| Uncertain query behavior | Compatible query execution model | +| Subgraph modification fears | Zero subgraph changes required | + +--- + +## Key Benefits + +1. **Drop-in Replacement**: Cosmo Router supports the same Federation directives, making it a compatible replacement for Apollo Router +2. **Configuration Automation**: Graph migration automatically generates Cosmo Router configuration +3. **No Subgraph Changes**: Your existing subgraphs work without modification +4. **Client Compatibility**: GraphQL clients continue working without updates +5. **Enhanced Features**: Gain access to Cosmo-specific features like advanced observability and Cosmo Studio integration + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Infrastructure Lead +- **Pain Points**: Router migration complexity, compatibility concerns, potential production disruption +- **Goals**: Seamless router transition with zero downtime and minimal risk + +### Secondary Personas +- DevOps engineers managing GraphQL infrastructure +- Backend developers working with federated services +- SREs responsible for GraphQL service reliability + +--- + +## Use Cases + +### Use Case 1: Router Replacement +**Scenario**: A team wants to replace Apollo Router with Cosmo Router for better observability +**How it works**: After migrating the graph configuration, deploy Cosmo Router with the generated config; update load balancer to point to new router +**Outcome**: Seamless transition with no client-side changes and immediate access to Cosmo observability features + +### Use Case 2: Gradual Traffic Migration +**Scenario**: An organization wants to minimize risk by gradually shifting traffic from Apollo Router to Cosmo Router +**How it works**: Deploy Cosmo Router alongside Apollo Router; use load balancer to split traffic; gradually increase Cosmo Router percentage +**Outcome**: Risk-mitigated migration with ability to roll back instantly if issues arise + +### Use Case 3: Development Environment First +**Scenario**: A team wants to validate Cosmo Router in development before production migration +**How it works**: Migrate development variant first; run integration tests; validate behavior; proceed to staging and production +**Outcome**: Confidence in migration through validated behavior in lower environments + +--- + +## Competitive Positioning + +### Key Differentiators +1. Full Federation v1 and v2 compatibility ensures seamless migration +2. Integrated migration path with Cosmo Studio for configuration management +3. Enhanced observability features available immediately after migration + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Will our queries still work?" | Yes—Cosmo Router supports the same Federation query execution model | +| "Do we need to update our subgraphs?" | No—your existing Federation-compatible subgraphs work without changes | +| "What about our custom directives?" | Core Federation directives are fully supported; custom directives can be evaluated | + +--- + +## Technical Summary + +### How It Works +Cosmo Router implements the Federation specification, supporting all standard Federation v1 and v2 directives. After migrating your graph configuration from Apollo GraphOS, Cosmo generates the router configuration automatically. The router handles query planning, subgraph orchestration, and response aggregation compatibly with Apollo Router behavior. + +### Key Technical Features +- Federation v1 and v2 directive support +- Compatible query planning and execution +- Automatic configuration generation from migration +- Same subgraph communication protocols (HTTP/GraphQL) + +### Integration Points +- Existing Federation subgraphs (no changes required) +- GraphQL clients (no changes required) +- Load balancers and API gateways +- Cosmo Studio for management and observability + +### Requirements & Prerequisites +- Completed graph migration from Apollo GraphOS (or manual Cosmo setup) +- Container orchestration platform (Kubernetes, Docker, etc.) +- Network access to subgraph endpoints + +--- + +## Documentation References + +- Migration guide: `/docs/studio/migrate-from-apollo` +- Router documentation: `/docs/router` +- Router configuration: `/docs/router/configuration` +- Federation compatibility: `/docs/federation/federation-compatibility-matrix` + +--- + +## Keywords & SEO + +### Primary Keywords +- Apollo Router migration +- Router replacement +- Cosmo Router + +### Secondary Keywords +- GraphQL router migration +- Federation router +- Apollo Router alternative + +### Related Search Terms +- How to replace Apollo Router +- Migrate from Apollo Router to Cosmo +- Apollo Router vs Cosmo Router + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/migration/federation-compatibility.md b/capabilities/migration/federation-compatibility.md new file mode 100644 index 00000000..2aef89a4 --- /dev/null +++ b/capabilities/migration/federation-compatibility.md @@ -0,0 +1,174 @@ +# Federation Compatibility + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-mig-003` | +| **Category** | Migration | +| **Status** | GA | +| **Availability** | Free | +| **Related Capabilities** | `cap-mig-001`, `cap-mig-002` | + +--- + +## Quick Reference + +### Name +Federation v1/v2 Compatibility + +### Tagline +Full compatibility with Apollo Federation v1 and v2. + +### Elevator Pitch +WunderGraph Cosmo provides comprehensive compatibility with Apollo Federation specifications, supporting both v1 and v2 directives. Teams can migrate existing federated graphs without modifying subgraph schemas, ensuring a smooth transition path while gaining access to Cosmo's enhanced features. + +--- + +## Problem & Solution + +### The Problem +Teams with existing Apollo Federation implementations need assurance that their schemas, directives, and subgraph configurations will work correctly when migrating to a new platform. Incomplete directive support could mean extensive schema rewrites, breaking changes, and extended migration timelines. + +### The Solution +Cosmo implements comprehensive Federation compatibility, supporting all standard Federation v1 directives and the vast majority of v2 directives through version 2.5. This means your existing federated schemas work out of the box, with no modifications required for core Federation functionality. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Uncertainty about directive support | Clear compatibility matrix | +| Fear of schema rewrites | Existing schemas work unchanged | +| Version compatibility concerns | Support for v1 through v2.5 | +| Migration risk | Validated compatibility | + +--- + +## Key Benefits + +1. **Complete v1 Support**: All Federation v1 directives (@key, @extends, @external, @provides, @requires, @tag) fully supported +2. **Extensive v2 Support**: Comprehensive support for v2 directives through version 2.5 including @shareable, @inaccessible, @override, @authenticated, @requiresScopes +3. **Zero Schema Changes**: Migrate existing federated schemas without modification +4. **Future-Ready**: Ongoing development to support new Federation features as they're released +5. **Transparent Roadmap**: Clear documentation of supported vs. planned directives + +--- + +## Target Audience + +### Primary Persona +- **Role**: API Architect / Backend Developer +- **Pain Points**: Schema compatibility concerns, directive support uncertainty, migration complexity +- **Goals**: Ensure existing Federation schemas work without modification; understand exactly what's supported + +### Secondary Personas +- Platform engineers evaluating migration feasibility +- Technical leads assessing platform compatibility +- DevOps teams planning migration timelines + +--- + +## Use Cases + +### Use Case 1: Federation v1 Migration +**Scenario**: A team running Federation v1 with @key, @extends, @external, @provides, and @requires directives +**How it works**: All v1 directives are fully supported; schemas migrate directly without changes +**Outcome**: Complete schema compatibility with zero modifications required + +### Use Case 2: Federation v2 with Authorization +**Scenario**: A team using Federation v2.5 with @authenticated and @requiresScopes for field-level authorization +**How it works**: Cosmo supports both @authenticated and @requiresScopes directives natively +**Outcome**: Authorization logic continues working identically after migration + +### Use Case 3: Mixed Version Subgraphs +**Scenario**: An organization with some subgraphs using v1 directives and others using v2 +**How it works**: Cosmo's composition engine handles mixed-version subgraphs, supporting all directives appropriately +**Outcome**: No need to standardize subgraph versions before migration + +--- + +## Competitive Positioning + +### Key Differentiators +1. Transparent compatibility matrix with clear support status for each directive +2. Support for latest Federation 2.5 features including authorization directives +3. Composite key support for complex entity resolution scenarios + +### Comparison with Alternatives + +| Directive Category | Cosmo | Federation Spec | +|-------------------|-------|-----------------| +| Core v1 (@key, @extends, @external, @provides, @requires) | Full Support | Reference | +| v2.0 (@shareable, @inaccessible, @override) | Full Support | Reference | +| v2.3 (@interfaceObject, @key on INTERFACE) | Full Support | Reference | +| v2.5 (@authenticated, @requiresScopes) | Full Support | Reference | +| @composeDirective | Planned | v2.1 | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Do you support composite keys?" | Yes—@key with composite keys is fully supported | +| "What about @interfaceObject?" | Fully supported as of Federation 2.3 compatibility | +| "Is @shareable supported?" | Yes—full support for shareable fields across subgraphs | + +--- + +## Technical Summary + +### How It Works +Cosmo's composition engine processes subgraph schemas and applies Federation directives according to the specification. The router generates optimized query plans that respect entity resolution, key fields, and cross-subgraph references defined by Federation directives. + +### Key Technical Features +- **Federation v1 Directives**: @extends, @external, @key (including composite keys), @provides, @requires, @tag +- **Federation v2.0 Directives**: @inaccessible, @override, @shareable, @key "resolvable" argument, @link +- **Federation v2.1 Directives**: @requires "fields" argument (supported), @composeDirective (planned) +- **Federation v2.3 Directives**: @key on INTERFACE, @interfaceObject +- **Federation v2.5 Directives**: @authenticated, @requiresScopes + +### Integration Points +- Apollo Federation-compatible subgraphs +- Any GraphQL server implementing Federation spec +- Schema registries and CI/CD pipelines + +### Requirements & Prerequisites +- Subgraphs implementing Federation v1 or v2 specification +- Valid Federation schema with proper directive usage +- Cosmo account for graph management + +--- + +## Documentation References + +- Compatibility matrix: `/docs/federation/federation-compatibility-matrix` +- @shareable directive: `/docs/federation/directives/shareable` +- @authenticated directive: `/docs/federation/directives/authenticated` +- @requiresScopes directive: `/docs/federation/directives/requiresscopes` +- Federation overview: `/docs/federation` + +--- + +## Keywords & SEO + +### Primary Keywords +- Federation compatibility +- Apollo Federation support +- Federation v2 compatibility + +### Secondary Keywords +- GraphQL Federation directives +- Federation migration +- Subgraph compatibility + +### Related Search Terms +- Does Cosmo support Federation v2 +- Federation directive compatibility matrix +- Apollo Federation alternatives with v2 support + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/notifications/alerts-notifications.md b/capabilities/notifications/alerts-notifications.md new file mode 100644 index 00000000..e2598bab --- /dev/null +++ b/capabilities/notifications/alerts-notifications.md @@ -0,0 +1,199 @@ +# Alerts & Notifications + +Multi-channel alerting system for schema changes and graph events in your federated GraphQL platform. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-notifications-001` | +| **Category** | Notifications | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-notifications-002`, `cap-notifications-003` | + +--- + +## Quick Reference + +### Name +Alerts & Notifications + +### Tagline +Stay informed on every schema change instantly. + +### Elevator Pitch +Cosmo's Alerts & Notifications system keeps your team informed about critical changes to your federated graphs in real-time. Whether through webhooks for custom integrations or direct Slack notifications, you never miss important schema updates that could impact your API consumers. + +--- + +## Problem & Solution + +### The Problem +In federated GraphQL architectures, schema changes happen frequently as multiple teams evolve their subgraphs independently. Without proper alerting, teams miss critical updates, breaking changes go unnoticed, and coordination between API producers and consumers becomes chaotic. Manual monitoring is impractical at scale, and traditional logging tools lack GraphQL-specific context. + +### The Solution +Cosmo provides a unified notification hub that automatically detects and alerts on schema changes across your entire federated graph. Configure multiple notification channels per organization, subscribe to specific events, and ensure the right people know about changes the moment they happen—all without writing custom monitoring code. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manually checking for schema changes | Automatic instant notifications | +| Missing critical breaking changes | Real-time alerts on every schema update | +| No visibility into who changed what | Event payloads include actor information | +| Siloed knowledge in individual teams | Organization-wide notification channels | + +--- + +## Key Benefits + +1. **Real-Time Awareness**: Receive instant notifications the moment your federated graph schema changes +2. **Multi-Channel Flexibility**: Choose between webhooks for custom integrations or native Slack integration for direct team communication +3. **Event Filtering**: Subscribe only to the events that matter to your team, reducing noise +4. **Secure Delivery**: HMAC signature verification ensures webhook payloads are authentic and untampered +5. **Team Coordination**: Keep all stakeholders informed automatically, improving cross-team alignment + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Product Owner +- **Pain Points**: Difficulty keeping track of schema changes across multiple subgraphs; teams working in silos without visibility into each other's changes +- **Goals**: Maintain awareness of all graph changes; ensure smooth coordination between API producers and consumers + +### Secondary Personas +- Backend developers who need to know when dependent schemas change +- DevOps/SRE teams monitoring production graph health +- Engineering managers tracking graph evolution + +--- + +## Use Cases + +### Use Case 1: Cross-Team Schema Coordination +**Scenario**: Multiple teams contribute subgraphs to a shared federated graph, and downstream teams need to know when upstream schemas change. +**How it works**: Configure Slack notifications for the `FEDERATED_GRAPH_SCHEMA_UPDATED` event. When any subgraph publishes changes that update the federated schema, all subscribed channels receive an alert with the graph name, namespace, and whether errors occurred. +**Outcome**: Teams are immediately aware of changes and can proactively update their integrations before issues arise in production. + +### Use Case 2: Custom CI/CD Pipeline Integration +**Scenario**: An organization wants to trigger automated tests or deployment workflows whenever the federated graph schema updates. +**How it works**: Set up a webhook endpoint that receives schema change events. The webhook payload includes the federated graph ID, name, namespace, and error status. The CI/CD system parses this payload and triggers appropriate downstream jobs. +**Outcome**: Fully automated response to schema changes, ensuring tests run and documentation updates without manual intervention. + +### Use Case 3: Production Change Auditing +**Scenario**: A compliance-focused organization needs to track and audit all schema changes with actor attribution. +**How it works**: Configure webhooks to capture all schema update events. Each event includes an optional `actor_id` field identifying who made the change. The webhook endpoint logs these events to an audit system. +**Outcome**: Complete audit trail of all schema changes with attribution, satisfying compliance requirements. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Native integration with both webhooks and Slack, not requiring third-party middleware +2. GraphQL-specific event types designed for federated architectures +3. Organization-level configuration supporting multiple notification channels simultaneously + +### Comparison with Alternatives + +| Aspect | Cosmo | Generic Webhook Services | Manual Monitoring | +|--------|-------|--------------------------|-------------------| +| GraphQL-specific events | Yes | No | N/A | +| Native Slack integration | Yes | Requires middleware | No | +| HMAC verification | Built-in | Varies | N/A | +| Zero setup code | Yes | Requires integration code | High effort | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We already have monitoring tools" | Cosmo's notifications are GraphQL-aware, providing schema-specific context that generic monitoring tools miss | +| "Too many notifications will create noise" | Event filtering lets you subscribe only to specific events, and you can configure multiple channels for different audiences | + +--- + +## Technical Summary + +### How It Works +Cosmo monitors your federated graph for schema changes and other significant events. When an event occurs, the system dispatches notifications to all configured channels (webhooks and Slack integrations). Webhook deliveries include HMAC signatures for verification, and Slack messages are sent directly to your chosen channels via the Slack API. + +### Key Technical Features +- Event-driven architecture with immediate dispatch +- HMAC-SHA256 signature verification for webhooks via `X-Cosmo-Signature-256` header +- Structured JSON payloads with versioned event schemas +- Multiple webhook and Slack integrations per organization +- Namespace-aware event filtering + +### Integration Points +- Custom webhook endpoints (any HTTP endpoint) +- Slack workspaces and channels +- CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, etc.) +- Audit and logging systems + +### Requirements & Prerequisites +- Active Cosmo organization +- For webhooks: publicly accessible HTTPS endpoint +- For Slack: Slack workspace with permission to add integrations + +--- + +## Proof Points + +### Metrics & Benchmarks +- Near-instant notification delivery (typically under 1 second) +- Support for multiple simultaneous notification channels +- Cryptographically secure webhook verification + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/studio/alerts-and-notifications` +- Webhooks guide: `/docs/studio/alerts-and-notifications/webhooks` +- Slack integration: `/docs/studio/alerts-and-notifications/slack-integration` +- Platform webhooks: `/docs/control-plane/webhooks` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL notifications +- Federated graph alerts +- Schema change notifications + +### Secondary Keywords +- GraphQL webhooks +- API schema monitoring +- Federation alerts + +### Related Search Terms +- How to get notified of GraphQL schema changes +- Federated GraphQL monitoring +- Apollo Federation alternative notifications +- GraphQL Slack integration + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/notifications/slack-integration.md b/capabilities/notifications/slack-integration.md new file mode 100644 index 00000000..82f9e61d --- /dev/null +++ b/capabilities/notifications/slack-integration.md @@ -0,0 +1,200 @@ +# Slack Integration + +Direct Slack channel notifications for federated graph events without any middleware. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-notifications-003` | +| **Category** | Notifications | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-notifications-001`, `cap-notifications-002` | + +--- + +## Quick Reference + +### Name +Slack Integration + +### Tagline +Graph updates delivered directly to your Slack channels. + +### Elevator Pitch +Cosmo's native Slack Integration brings federated graph notifications directly into your team's communication hub. No middleware, no custom bots—just authorize the integration, select your channel, and start receiving instant alerts about schema changes. Keep your entire team informed without leaving Slack. + +--- + +## Problem & Solution + +### The Problem +Development teams live in Slack. When federated graph schemas change, the people who need to know are scattered across channels, time zones, and responsibilities. Building custom Slack bots or routing webhooks through middleware adds complexity, maintenance burden, and potential points of failure. Teams end up missing critical updates or spending hours building integrations that should be standard. + +### The Solution +Cosmo provides native Slack integration that connects directly to your workspace with a few clicks. Authorize the WunderGraph Cosmo app, select your notification channel, choose which events to receive, and you're done. No code to write, no infrastructure to maintain—just reliable, instant notifications where your team already works. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Building custom Slack bots | Native one-click integration | +| Routing webhooks through middleware | Direct Slack API connection | +| Maintaining bot infrastructure | Zero maintenance overhead | +| Manual notification setup per channel | Organization-wide configuration | + +--- + +## Key Benefits + +1. **Zero-Code Setup**: Connect to Slack in minutes with OAuth-based authorization, no development required +2. **Native Integration**: Direct connection to Slack's API, not a workaround through webhooks or middleware +3. **Channel Flexibility**: Send notifications to any channel in your workspace +4. **Event Selection**: Subscribe only to the events your team cares about +5. **Team Visibility**: Keep everyone informed in the collaboration tool they already use daily + +--- + +## Target Audience + +### Primary Persona +- **Role**: Engineering Manager / Team Lead +- **Pain Points**: Keeping the team informed about graph changes without creating noise; avoiding custom integration maintenance +- **Goals**: Streamline team communication; ensure critical updates reach the right people + +### Secondary Personas +- Platform engineers who want simple notification setup +- Developers who prefer receiving updates in Slack over email +- DevOps teams managing notification infrastructure + +--- + +## Use Cases + +### Use Case 1: Team-Wide Schema Change Awareness +**Scenario**: A platform team wants all engineers to see when the federated graph schema updates so they can check for impacts on their services. +**How it works**: Set up Slack integration pointing to a shared engineering channel. Subscribe to `FEDERATED_GRAPH_SCHEMA_UPDATED` events. When any schema change occurs, the entire team sees it instantly in their daily communication channel. +**Outcome**: No more "I didn't know that changed" moments—everyone is informed simultaneously. + +### Use Case 2: Dedicated Alerts Channel +**Scenario**: An organization wants to separate graph notifications from regular team chat to maintain focus while ensuring visibility. +**How it works**: Create a dedicated `#graph-alerts` channel in Slack. Configure the Cosmo Slack integration to post all notifications to this channel. Team members can join to stay informed or configure Slack notifications based on their role. +**Outcome**: Clean separation between alerts and discussion, with full visibility available to those who need it. + +### Use Case 3: Multi-Channel Notification Routing +**Scenario**: Different teams need to know about changes to different federated graphs in a multi-graph organization. +**How it works**: Set up multiple Slack integrations, each with a descriptive name and pointing to different team channels. Configure each integration to receive events relevant to that team's graphs. +**Outcome**: Targeted notifications reach the right teams without flooding everyone with irrelevant updates. + +--- + +## Competitive Positioning + +### Key Differentiators +1. True native Slack integration using official Slack OAuth, not webhook-to-Slack bridges +2. Purpose-built for GraphQL federation events with meaningful, formatted messages +3. Multiple named integrations per organization for complex notification routing + +### Comparison with Alternatives + +| Aspect | Cosmo Slack Integration | Webhook + Slack Middleware | Custom Slack Bot | +|--------|-------------------------|---------------------------|------------------| +| Setup time | 2 minutes | 30+ minutes | Hours/days | +| Code required | None | Some | Significant | +| Maintenance | None | Medium | High | +| Official Slack OAuth | Yes | No | Yes | +| GraphQL-aware messages | Yes | Requires custom formatting | Requires implementation | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We already have a custom Slack bot" | Cosmo's integration is purpose-built for federation events and requires zero maintenance—let your bot handle other tasks | +| "We need custom message formatting" | For advanced customization, use webhooks with your own formatting; for standard use cases, the native integration provides clear, consistent messages | +| "What permissions does it need?" | The integration only requests permission to post to the channel you select—no access to messages, files, or other channels | + +--- + +## Technical Summary + +### How It Works +Cosmo's Slack Integration uses Slack's official OAuth 2.0 flow. When you click "Integrate," you're redirected to Slack to authorize the WunderGraph Cosmo app and select a channel. Once authorized, you name the integration and select which events to receive. Cosmo then posts formatted notifications directly to your chosen channel via the Slack API whenever subscribed events occur. + +### Key Technical Features +- Official Slack OAuth 2.0 authorization flow +- Channel selection during setup +- Named integrations for easy management +- Event type subscription filtering +- Formatted messages with graph details + +### Integration Points +- Any Slack workspace (standard or Enterprise Grid) +- Public or private channels (based on authorization) +- Integration with existing Slack workflows and notification settings + +### Requirements & Prerequisites +- Slack workspace with permission to add integrations +- Channel where you want to receive notifications +- Permission to authorize the WunderGraph Cosmo app + +--- + +## Proof Points + +### Metrics & Benchmarks +- Setup completed in under 2 minutes +- Zero lines of code required +- Instant delivery to Slack channels +- Support for multiple integrations per organization + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/studio/alerts-and-notifications/slack-integration` +- Overview: `/docs/studio/alerts-and-notifications` +- Webhooks alternative: `/docs/studio/alerts-and-notifications/webhooks` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL Slack notifications +- Slack integration for API changes +- Federation schema Slack alerts + +### Secondary Keywords +- GraphQL team notifications +- API change Slack bot +- Federated graph Slack integration + +### Related Search Terms +- How to get GraphQL schema changes in Slack +- Slack notifications for API updates +- Federation monitoring Slack +- GraphQL alerts Slack channel + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/notifications/webhook-notifications.md b/capabilities/notifications/webhook-notifications.md new file mode 100644 index 00000000..07032186 --- /dev/null +++ b/capabilities/notifications/webhook-notifications.md @@ -0,0 +1,222 @@ +# Webhook Notifications + +Custom webhook integration for receiving Cosmo events in your own systems and workflows. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-notifications-002` | +| **Category** | Notifications | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-notifications-001`, `cap-notifications-003` | + +--- + +## Quick Reference + +### Name +Webhook Notifications + +### Tagline +Receive graph events anywhere via secure webhooks. + +### Elevator Pitch +Cosmo Webhook Notifications let you integrate federated graph events directly into your existing systems and workflows. Set up secure, verified webhook endpoints to receive real-time notifications about schema changes, trigger CI/CD pipelines, update documentation, or feed custom dashboards—all with cryptographic verification to ensure data integrity. + +--- + +## Problem & Solution + +### The Problem +Organizations need to react programmatically to changes in their federated GraphQL graphs. When schemas update, downstream systems—CI/CD pipelines, documentation generators, monitoring dashboards, and custom tooling—need to be notified automatically. Without a native webhook system, teams resort to polling APIs or building custom event infrastructure, wasting engineering time and introducing latency. + +### The Solution +Cosmo's Webhook Notifications provide a first-class event delivery system that pushes graph events to any HTTP endpoint in real-time. Each webhook request includes HMAC signature verification, ensuring payloads are authentic and haven't been tampered with. Configure multiple webhooks per organization, each subscribing to specific events, and integrate seamlessly with your existing infrastructure. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Polling APIs for changes | Real-time push notifications | +| Building custom event infrastructure | Out-of-the-box webhook system | +| No verification of event authenticity | HMAC-SHA256 signature verification | +| Single integration point | Multiple webhooks with independent configurations | + +--- + +## Key Benefits + +1. **Real-Time Event Delivery**: Receive notifications immediately when events occur, no polling required +2. **Cryptographic Verification**: HMAC-SHA256 signatures ensure webhook payloads are authentic and untampered +3. **Flexible Integration**: Connect to any system that can receive HTTP requests—CI/CD, monitoring, custom tools +4. **Structured Payloads**: Versioned JSON payloads with consistent schema for reliable parsing +5. **Multiple Webhooks**: Configure multiple endpoints per organization with independent event subscriptions + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / DevOps Engineer +- **Pain Points**: Need to trigger automated workflows when graph changes occur; require secure, verifiable event delivery +- **Goals**: Automate responses to schema changes; integrate Cosmo events into existing infrastructure + +### Secondary Personas +- Backend developers building custom tooling around the federated graph +- Security engineers requiring audit trails and verified event sources +- SRE teams integrating with monitoring and alerting systems + +--- + +## Use Cases + +### Use Case 1: Automated CI/CD Triggers +**Scenario**: An organization wants to automatically run integration tests whenever the federated graph schema updates. +**How it works**: Configure a webhook pointing to a CI/CD trigger endpoint (e.g., GitHub Actions webhook, Jenkins trigger). Subscribe to the `FEDERATED_GRAPH_SCHEMA_UPDATED` event. When the schema updates, the webhook fires and triggers the test pipeline. +**Outcome**: Every schema change automatically triggers integration tests, catching issues before they reach production. + +### Use Case 2: Documentation Auto-Generation +**Scenario**: API documentation needs to stay synchronized with the current federated schema. +**How it works**: Set up a webhook endpoint that receives schema update events. The endpoint triggers a documentation generation workflow that fetches the latest schema and regenerates API docs. +**Outcome**: Documentation is always current, reducing support burden and improving developer experience for API consumers. + +### Use Case 3: Change Audit Logging +**Scenario**: A financial services company requires complete audit trails of all API schema changes for compliance. +**How it works**: Configure a webhook to send all schema events to a secure audit logging service. Each payload includes the federated graph details, error status, and actor ID. The logging service stores these with timestamps for compliance reporting. +**Outcome**: Complete, verifiable audit trail satisfying regulatory requirements with cryptographic proof of event authenticity. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Built specifically for GraphQL federation events, providing rich, contextual payloads +2. HMAC signature verification included by default, not an add-on +3. Organization-level management supporting multiple independent webhook configurations + +### Comparison with Alternatives + +| Aspect | Cosmo Webhooks | Generic Event Systems | Custom Solutions | +|--------|----------------|----------------------|------------------| +| GraphQL-specific events | Yes | No | Requires implementation | +| Built-in HMAC verification | Yes | Varies | Requires implementation | +| Setup time | Minutes | Hours | Days/weeks | +| Maintenance burden | None | Medium | High | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We need custom event formats" | Cosmo's versioned JSON payloads are standard and easily transformed in your receiving endpoint | +| "How do we know events are legitimate?" | Every webhook request includes HMAC-SHA256 signature in the X-Cosmo-Signature-256 header for cryptographic verification | +| "What if our endpoint is temporarily down?" | Configure multiple webhooks as backup, and implement standard retry logic in your infrastructure | + +--- + +## Technical Summary + +### How It Works +When you create a webhook in Cosmo, you provide an endpoint URL and a secret key. When subscribed events occur, Cosmo sends an HTTP POST request to your endpoint with a JSON payload. The request includes an `X-Cosmo-Signature-256` header containing an HMAC-SHA256 signature computed using your secret, allowing you to verify the payload's authenticity. + +### Key Technical Features +- HTTP POST delivery to any accessible endpoint +- HMAC-SHA256 signature verification via `X-Cosmo-Signature-256` header +- Versioned JSON payloads (current version: 1) +- Event types: `FEDERATED_GRAPH_SCHEMA_UPDATED` +- Payload includes: federated graph details (id, name, namespace), error status, actor ID + +### Integration Points +- CI/CD systems (GitHub Actions, GitLab CI, Jenkins, CircleCI) +- Monitoring platforms (Datadog, PagerDuty, custom dashboards) +- Documentation generators +- Audit and compliance systems +- Custom internal tooling + +### Requirements & Prerequisites +- Publicly accessible HTTPS endpoint +- Ability to verify HMAC-SHA256 signatures +- Secret key for webhook verification (you provide this during setup) + +--- + +## Proof Points + +### Metrics & Benchmarks +- Sub-second event delivery +- Standard HMAC-SHA256 cryptographic verification +- Support for unlimited webhooks per organization + +### Code Example: Signature Verification + +```javascript +import crypto from 'crypto'; + +function verifySignature(body, receivedSignature, secret) { + const computedSignature = crypto + .createHmac('sha256', secret) + .update(body) + .digest('hex'); + + return computedSignature === receivedSignature; +} + +// Usage: +const isVerified = verifySignature( + JSON.stringify(req.body), + req.headers['x-cosmo-signature-256'], + YOUR_SECRET +); +``` + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/studio/alerts-and-notifications/webhooks` +- Overview: `/docs/studio/alerts-and-notifications` +- Platform webhooks: `/docs/control-plane/webhooks` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL webhooks +- Federation event webhooks +- Schema change webhooks + +### Secondary Keywords +- GraphQL event notifications +- API webhook integration +- Federated graph events + +### Related Search Terms +- How to receive GraphQL schema change events +- Webhook integration for GraphQL federation +- Secure webhook verification GraphQL +- HMAC webhook signature verification + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/observability/access-logs.md b/capabilities/observability/access-logs.md new file mode 100644 index 00000000..d71c095d --- /dev/null +++ b/capabilities/observability/access-logs.md @@ -0,0 +1,233 @@ +# Access Logs + +Configurable request logging to stdout or file with custom fields. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-obs-007` | +| **Category** | Observability | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-obs-001`, `cap-obs-002` | + +--- + +## Quick Reference + +### Name +Access Logs + +### Tagline +Detailed request logging for traffic analysis. + +### Elevator Pitch +Access logs provide comprehensive visibility into every request flowing through your Cosmo Router. Log to stdout for container environments or to files for high-load scenarios. Customize logged fields to capture request headers, GraphQL operation details, and timing information - all in structured JSON format ready for analysis. + +--- + +## Problem & Solution + +### The Problem +Teams need detailed logs of GraphQL traffic for debugging, auditing, and compliance purposes. Standard HTTP logs lack GraphQL context like operation names, types, and error details. High-traffic systems need efficient file-based logging, while container deployments need stdout output. Custom fields are needed for correlation with internal systems. + +### The Solution +Cosmo Router's access logs capture comprehensive request information including GraphQL-specific context. Teams can log to stdout for containerized environments or to files for high-load production systems. Custom fields extract values from headers, request context, or expressions, enabling rich correlation and debugging capabilities. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Generic HTTP logs only | GraphQL-aware request logging | +| No operation context | Operation name, type, hash logged | +| Fixed log format | Customizable fields from headers/context | +| stdout only | stdout or file output options | + +--- + +## Key Benefits + +1. **GraphQL Context**: Log operation names, types, hashes, and timing breakdowns +2. **Flexible Output**: stdout for containers, file output for high-load scenarios +3. **Custom Fields**: Extract values from headers, context, or expressions +4. **Subgraph Logging**: Separate logs for router and subgraph requests +5. **Structured Format**: JSON output for easy parsing and analysis + +--- + +## Target Audience + +### Primary Persona +- **Role**: DevOps Engineer / SRE +- **Pain Points**: Lack of GraphQL context in logs; difficulty correlating requests; compliance requirements +- **Goals**: Comprehensive audit trail; efficient debugging; traffic analysis + +### Secondary Personas +- Security teams requiring audit logs +- Backend developers debugging issues +- Compliance officers reviewing access patterns + +--- + +## Use Cases + +### Use Case 1: Request Debugging +**Scenario**: A specific GraphQL operation is failing intermittently and the team needs to understand the pattern. +**How it works**: Enable access logs with custom fields for operation name, error codes, and service names. Filter logs by operation to analyze failure patterns. +**Outcome**: Discovered that failures correlate with specific client versions; targeted fix deployed. + +### Use Case 2: Compliance Auditing +**Scenario**: Security team requires logs of all API access with client identification. +**How it works**: Configure access logs with fields extracted from authentication headers and client identifiers. Store logs to file for retention. +**Outcome**: Complete audit trail for compliance review with client attribution. + +### Use Case 3: Performance Analysis +**Scenario**: Team needs to understand operation timing breakdown for optimization. +**How it works**: Enable custom fields for parsing, planning, normalization, and validation times. Analyze logs to identify slow operations. +**Outcome**: Identified operations with excessive planning time; optimized schema structure. + +--- + +## Competitive Positioning + +### Key Differentiators +1. GraphQL-native log fields (operation name, type, hash) +2. Timing breakdown (parsing, planning, normalization, validation) +3. Subgraph-level logging for federated requests +4. Expression-based custom fields for complex extractions + +### Comparison with Alternatives + +| Aspect | Cosmo | Generic Logging | Custom Solution | +|--------|-------|-----------------|-----------------| +| GraphQL Context | Native | None | Manual | +| Custom Fields | Header + Expression | Limited | Manual | +| Subgraph Logs | Built-in | N/A | Manual | +| Performance | Optimized | Varies | Varies | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Logging impacts performance" | File-based logging is optimized for high load; configurable fields minimize overhead | +| "We use centralized logging" | stdout output integrates with any log aggregator; JSON format enables easy parsing | +| "Too much data to store" | Configurable fields let you capture only what you need; log level controls verbosity | + +--- + +## Technical Summary + +### How It Works +Access logs capture request information at configurable points in the request lifecycle. Logs can be output to stdout (default) or to a file for high-load scenarios. Custom fields extract values from request headers, context (populated during request processing), or expressions. Structured JSON format enables easy parsing by log aggregation systems. + +### Key Technical Features + +**Default Fields:** +- `hostname`, `pid`, `request_id`, `trace_id` +- `status`, `method`, `path`, `query` +- `ip` (redacted by default), `user_agent` +- `config_version`, `latency`, `log_type` + +**Custom Context Fields:** +- `operation_name`, `operation_type`, `operation_hash` +- `operation_sha256`, `persisted_operation_sha256` +- `operation_parsing_time`, `operation_planning_time` +- `operation_normalization_time`, `operation_validation_time` +- `graphql_error_codes`, `graphql_error_service_names` +- `request_error`, `response_error_message` + +**Configuration Example:** +```yaml +access_logs: + enabled: true + level: info + output: + file: + enabled: true + path: /var/log/gateway/access.log + mode: '0644' + router: + fields: + - key: "operationName" + value_from: + context_field: operation_name + - key: "service" + value_from: + request_header: "x-service" +``` + +### Integration Points +- Log aggregation systems (ELK, Splunk, etc.) +- Container logging (stdout for Docker, Kubernetes) +- File-based log rotation tools +- SIEM systems for security analysis + +### Requirements & Prerequisites +- Cosmo Router 0.118.0+ for access logs +- Cosmo Router 0.146.0+ for subgraph access logs +- Cosmo Router 0.186.0+ for expression fields +- Write access for file-based logging + +--- + +## Proof Points + +### Metrics & Benchmarks +- File-based logging recommended for high-load scenarios +- Configurable file permissions (default: 0640) +- Structured JSON output for efficient parsing + +### Customer Quotes +> "The timing breakdown fields in access logs helped us identify that query planning was our bottleneck. We couldn't have found this without the GraphQL-specific context." - Backend Engineer + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/access-logs` +- Expression fields: `/docs/router/configuration/template-expressions` +- Configuration reference: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL access logs +- Request logging GraphQL +- Federation access logs + +### Secondary Keywords +- GraphQL audit logging +- Custom log fields GraphQL +- Subgraph request logging + +### Related Search Terms +- How to log GraphQL requests +- GraphQL operation logging +- Federation request tracing logs +- Custom fields access logs + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/observability/advanced-request-tracing.md b/capabilities/observability/advanced-request-tracing.md new file mode 100644 index 00000000..96e78999 --- /dev/null +++ b/capabilities/observability/advanced-request-tracing.md @@ -0,0 +1,201 @@ +# Advanced Request Tracing (ART) + +Detailed execution plan tracing with Playground visualization for deep debugging. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-obs-003` | +| **Category** | Observability | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-obs-002`, `cap-obs-001` | + +--- + +## Quick Reference + +### Name +Advanced Request Tracing (ART) + +### Tagline +See exactly how the Router resolves every request. + +### Elevator Pitch +Advanced Request Tracing (ART) reveals the complete execution plan for every GraphQL request, showing exactly how the Router resolves queries across your federated graph. Debug complex queries, understand fetch patterns, and optimize performance with detailed timing and data flow visibility - directly in the GraphQL Playground. + +--- + +## Problem & Solution + +### The Problem +When GraphQL queries behave unexpectedly in a federated architecture, developers struggle to understand why. They cannot see the execution plan the router generates, what subgraph requests are made, in what order, or how data flows between fetches. Without this visibility, optimizing query performance or debugging unexpected behavior becomes guesswork. + +### The Solution +ART renders the complete execution plan as JSON in the GraphQL response extensions. Developers see exactly what fetches the router performs (parallel, serial, entity, batch), the actual requests sent to subgraphs, input/output data for each operation, and precise timing information. This deep visibility enables rapid debugging and optimization directly from the Playground. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Guessing how queries are executed | Complete execution plan visibility | +| Unknown subgraph request patterns | See parallel, serial, and batch fetches | +| No insight into data flow | View input/output for each fetch | +| Unclear latency sources | Precise timing for every operation | + +--- + +## Key Benefits + +1. **Complete Execution Visibility**: See the full execution plan including fetch order and dependencies +2. **Subgraph Request Details**: View actual requests sent to each subgraph with input data +3. **Performance Analysis**: Precise timing for planning and each load operation +4. **Production-Safe Debugging**: Secure mechanism enables debugging production routers from Studio +5. **Playground Integration**: Results displayed directly in the GraphQL Playground interface + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / GraphQL Engineer +- **Pain Points**: Cannot understand how federated queries are executed; difficulty optimizing complex queries +- **Goals**: Debug unexpected behavior; optimize query performance; understand federation execution + +### Secondary Personas +- Platform engineers troubleshooting production issues +- DevOps engineers investigating slow requests +- Technical leads reviewing query patterns + +--- + +## Use Cases + +### Use Case 1: Query Optimization +**Scenario**: A product page query takes 2 seconds but the team cannot identify the bottleneck. +**How it works**: Enable ART in Playground, execute the query, examine the trace to see that 5 serial fetches are being made when they could be parallelized. +**Outcome**: Schema restructuring enables parallel fetches, reducing latency to 400ms. + +### Use Case 2: Understanding Federation Behavior +**Scenario**: A developer is new to federation and wants to understand how their query is being resolved. +**How it works**: Enable ART for their query, see the execution plan showing entity fetches, key fields used, and data flow between subgraphs. +**Outcome**: Developer gains understanding of federation execution patterns and designs more efficient queries. + +### Use Case 3: Production Debugging +**Scenario**: A specific query works in staging but fails intermittently in production. +**How it works**: Use the secure ART mechanism to trace the query against production router, compare execution plans between environments. +**Outcome**: Discover that production has different subgraph response shapes causing data mapping issues. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Complete execution plan visibility including fetch dependencies +2. Secure production debugging via Cosmo Studio integration +3. Granular control over what trace data is included +4. Direct integration with GraphQL Playground + +### Comparison with Alternatives + +| Aspect | Cosmo ART | Query Plans | Manual Logging | +|--------|-----------|-------------|----------------| +| Execution Plan | Full detail | Partial | None | +| Fetch Details | Input/Output | Limited | Manual | +| Production Safe | Yes | N/A | Risk | +| Setup Required | Enabled by default | Schema changes | Code changes | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Security concern in production" | ART requires secure authentication via control plane connection; never exposes data to unauthorized users | +| "Performance overhead" | ART is only active when explicitly requested via header; no overhead on normal requests | +| "Too much information" | Configurable exclusions let you omit planner stats, input data, or output data as needed | + +--- + +## Technical Summary + +### How It Works +When ART is enabled via the `X-WG-Trace` header or `wg_trace` query parameter, the Router captures detailed execution information and includes it in the response under `extensions.trace`. This includes the execution plan structure, fetch types (parallel, serial, entity, batch), actual subgraph requests, input/output data, and timing information. + +### Key Technical Features +- Enabled by default (can be disabled via `ENGINE_ENABLE_REQUEST_TRACING=false`) +- Activated per-request via header (`X-WG-Trace=true`) or query parameter (`wg_trace=true`) +- Configurable exclusions: planner stats, raw input, rendered input, output, load stats +- Secure production access requires control plane connection (Router 0.42.3+) +- Debug mode available in development via `DEV_MODE=true` + +### Integration Points +- GraphQL Playground for visualization +- Cosmo Studio for secure production access +- Control Plane for authentication + +### Requirements & Prerequisites +- Cosmo Router 0.42.3+ for production debugging +- Control Plane connection for secure access +- Development mode (`DEV_MODE=true`) for local debugging + +--- + +## Proof Points + +### Metrics & Benchmarks +- Zero overhead when not activated +- Complete trace data in response extensions +- Works with all GraphQL operation types (except subscriptions currently) + +### Customer Quotes +> "ART changed how we debug federation issues. Being able to see exactly what requests go to which subgraphs saved us countless hours." - GraphQL Platform Engineer + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/advanced-request-tracing-art` +- Configuration reference: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL execution plan +- Federation request tracing +- Query debugging GraphQL + +### Secondary Keywords +- GraphQL performance debugging +- Federated query optimization +- Subgraph request visibility + +### Related Search Terms +- How to debug GraphQL federation +- GraphQL query execution plan +- Federation fetch optimization +- GraphQL slow query debugging + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/observability/distributed-tracing.md b/capabilities/observability/distributed-tracing.md new file mode 100644 index 00000000..4779717b --- /dev/null +++ b/capabilities/observability/distributed-tracing.md @@ -0,0 +1,204 @@ +# Distributed Tracing + +End-to-end request tracing across federated services with visualization in Cosmo Studio. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-obs-002` | +| **Category** | Observability | +| **Status** | GA | +| **Availability** | Pro / Enterprise | +| **Related Capabilities** | `cap-obs-001`, `cap-obs-003` | + +--- + +## Quick Reference + +### Name +Distributed Tracing + +### Tagline +Debug federation issues in minutes, not hours. + +### Elevator Pitch +Distributed Tracing provides end-to-end visibility into every GraphQL request as it flows through your federated graph. Instantly identify slow subgraphs, pinpoint errors, and understand the complete request lifecycle - all from a single dashboard in Cosmo Studio. + +--- + +## Problem & Solution + +### The Problem +When a GraphQL query fails or performs slowly in a federated architecture, developers waste hours trying to identify which subgraph is responsible. With requests potentially touching dozens of services, traditional logging and monitoring tools lack the context needed to correlate events across service boundaries. Teams spend more time debugging than building features. + +### The Solution +Cosmo's Distributed Tracing automatically instruments your entire federated graph, capturing detailed timing and context for every operation. Each request gets a unique trace ID that follows it through the Router and into each subgraph, making it trivial to identify exactly where issues occur. The Studio UI provides interactive visualization with span details, attributes, and error information. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Hours spent correlating logs across services | Single trace view shows complete request path | +| Guessing which subgraph caused latency | Precise timing breakdown per subgraph | +| No visibility into resolver-level performance | Span-level execution insights | +| Manual trace ID propagation | Automatic trace context propagation | + +--- + +## Key Benefits + +1. **Reduce MTTR by 80%**: Pinpoint the exact subgraph and resolver causing issues within minutes +2. **Proactive Performance Optimization**: Identify slow paths before users complain +3. **Zero-Code Instrumentation**: Works automatically with the Cosmo Router +4. **OpenTelemetry Compatible**: Export traces to your existing observability stack +5. **Auto-Refresh UI**: Dashboard updates every 10 seconds for real-time monitoring + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: On-call debugging is painful; lack of visibility into federated requests; correlating errors across services +- **Goals**: Reduce incident response time; maintain SLAs; proactive issue detection + +### Secondary Personas +- Backend developers debugging performance issues +- Engineering managers tracking system health +- DevOps engineers monitoring production systems + +--- + +## Use Cases + +### Use Case 1: Production Incident Response +**Scenario**: A critical checkout API starts returning errors intermittently during peak traffic. +**How it works**: Filter traces by error status in Studio, see the exact subgraph returning errors, view the error message, extension codes, and full stack trace in the span details. +**Outcome**: Root cause identified in 5 minutes instead of 2 hours - the inventory subgraph was timing out due to database connection exhaustion. + +### Use Case 2: Performance Optimization +**Scenario**: Users report slow page loads on the product catalog, but the team cannot identify the bottleneck. +**How it works**: Analyze traces for the product query in Studio, identify the inventory subgraph adding 800ms latency, drill into specific span to see the slow database query. +**Outcome**: Targeted optimization of database indexes reduces latency by 60%. + +### Use Case 3: New Deployment Validation +**Scenario**: After deploying a new version of the reviews subgraph, the team wants to verify performance hasn't regressed. +**How it works**: Compare trace durations before and after deployment, filter by config version to isolate the new deployment's behavior. +**Outcome**: Early detection of a regression allows rollback before users are impacted. + +--- + +## Competitive Positioning + +### Key Differentiators +1. GraphQL-native trace visualization with operation context +2. Automatic instrumentation requiring no code changes +3. Integrated with Cosmo Studio for unified experience +4. Supports trace context propagation across service boundaries + +### Comparison with Alternatives + +| Aspect | Cosmo | Generic APM | DIY Tracing | +|--------|-------|-------------|-------------| +| GraphQL-Aware | Yes | No | Manual | +| Setup Effort | Zero-config | Integration required | Significant | +| Federation Support | Native | Limited | Manual | +| Unified Dashboard | Yes | Separate tool | Custom | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We already have Jaeger/Datadog" | Export traces to both Cosmo Cloud and your existing backend simultaneously | +| "Tracing adds latency" | Sampling and async export ensure minimal overhead on request path | +| "Our services already have tracing" | Cosmo propagates trace context so your existing spans are correlated with router spans | + +--- + +## Technical Summary + +### How It Works +The Cosmo Router automatically creates spans for each request phase: parsing, validation, planning, and execution. When the router fetches data from subgraphs, it propagates the trace context using W3C Trace Context headers. Subgraph spans are collected and displayed alongside router spans in the Studio UI, providing a complete view of the request lifecycle. + +### Key Technical Features +- W3C Trace Context propagation (default) +- Optional Jaeger, B3, and Baggage propagation support +- Request trace ID in response headers (configurable) +- GraphQL variables export for query replay +- Span attributes with operation details +- Error capture with stack traces + +### Integration Points +- Cosmo Studio for visualization +- Any OpenTelemetry-compatible backend +- Existing APM tools via trace context propagation +- GraphQL Playground for query replay + +### Requirements & Prerequisites +- Cosmo Router with tracing enabled (default) +- Subgraphs should propagate trace context for complete traces +- Studio access for visualization + +--- + +## Proof Points + +### Metrics & Benchmarks +- Auto-refresh interval: 10 seconds +- Trace retention: Based on plan tier +- Zero additional code required for basic tracing + +### Customer Quotes +> "Before Cosmo, debugging federated queries was like finding a needle in a haystack. Now I can see exactly which subgraph is causing issues in seconds." - Senior Backend Engineer + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/studio/analytics/distributed-tracing` +- Tracing configuration: `/docs/router/open-telemetry` +- Trace propagation: `/docs/router/open-telemetry#trace-propagation` + +--- + +## Keywords & SEO + +### Primary Keywords +- Distributed tracing GraphQL +- GraphQL federation debugging +- Request tracing visualization + +### Secondary Keywords +- GraphQL performance monitoring +- Federated graph observability +- Trace context propagation + +### Related Search Terms +- How to debug federated GraphQL +- GraphQL slow query analysis +- End-to-end request tracing +- Subgraph latency debugging + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/observability/grafana-integration.md b/capabilities/observability/grafana-integration.md new file mode 100644 index 00000000..65e537d8 --- /dev/null +++ b/capabilities/observability/grafana-integration.md @@ -0,0 +1,212 @@ +# Grafana Integration + +Pre-built dashboards for visualizing Cosmo Router metrics. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-obs-005` | +| **Category** | Observability | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-obs-004`, `cap-obs-001` | + +--- + +## Quick Reference + +### Name +Grafana Integration + +### Tagline +Production-ready dashboards for GraphQL observability. + +### Elevator Pitch +Get immediate visibility into your federated GraphQL system with pre-built Grafana dashboards. Track cache efficiency, monitor Go runtime health, and visualize traffic patterns without building dashboards from scratch. Use the provided templates as a foundation and customize to match your specific monitoring needs. + +--- + +## Problem & Solution + +### The Problem +Teams deploying federated GraphQL need monitoring dashboards but building effective visualizations takes significant time and expertise. They need to understand which metrics matter, how to query them effectively, and how to present the data meaningfully. Without good dashboards, teams fly blind or waste time recreating common visualizations. + +### The Solution +Cosmo provides pre-built Grafana dashboards that visualize the most important router metrics out of the box. These dashboards cover cache performance, Go runtime health, and network traffic patterns. Teams can use them directly or customize them for their specific needs, accelerating time-to-visibility. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Build dashboards from scratch | Pre-built dashboards ready to use | +| Trial and error with queries | Proven PromQL queries included | +| Generic monitoring only | GraphQL-specific visualizations | +| Hours of dashboard development | Minutes to production visibility | + +--- + +## Key Benefits + +1. **Immediate Visibility**: Pre-built dashboards provide instant monitoring capability +2. **Best Practice Queries**: PromQL queries optimized for GraphQL observability +3. **Customizable Templates**: Use as foundation and extend for your needs +4. **Dual Data Sources**: Pre-configured for both Prometheus and ClickHouse +5. **Cache Insights**: Dedicated dashboard for GraphQL operation cache performance + +--- + +## Target Audience + +### Primary Persona +- **Role**: DevOps Engineer / SRE +- **Pain Points**: Time-consuming dashboard creation; uncertainty about what to monitor +- **Goals**: Rapid production visibility; effective monitoring setup + +### Secondary Personas +- Platform engineers building monitoring infrastructure +- Engineering managers needing operational visibility +- Developers debugging performance issues + +--- + +## Use Cases + +### Use Case 1: Quick Production Setup +**Scenario**: Team is deploying Cosmo Router to production and needs monitoring immediately. +**How it works**: Clone the Cosmo repository, run `make infra-debug-up` to start Grafana and Prometheus, configure the router metrics endpoint, and access pre-built dashboards. +**Outcome**: Complete monitoring visibility within 30 minutes of deployment. + +### Use Case 2: Cache Performance Optimization +**Scenario**: Team suspects cache inefficiency is causing unnecessary subgraph load. +**How it works**: Use the Router Cache Metrics dashboard to view hit ratios for execution, validation, and normalization caches. Identify low hit rates and adjust cache configuration. +**Outcome**: Cache hit rate improved from 60% to 95%, reducing subgraph requests by 40%. + +### Use Case 3: Memory Leak Investigation +**Scenario**: Router instances are experiencing memory growth over time. +**How it works**: Use the Go Runtime Metrics dashboard to track heap allocations, GC cycles, and goroutine counts. Correlate memory growth with specific traffic patterns or operations. +**Outcome**: Identified a subscription leak causing goroutine growth; fixed in application code. + +--- + +## Competitive Positioning + +### Key Differentiators +1. GraphQL-specific dashboard templates +2. Pre-configured data sources for Prometheus and ClickHouse +3. Docker Compose setup for local development +4. Part of the Cosmo open-source ecosystem + +### Comparison with Alternatives + +| Aspect | Cosmo Dashboards | Generic Templates | Build from Scratch | +|--------|------------------|-------------------|-------------------| +| GraphQL-Aware | Yes | No | Manual | +| Setup Time | 30 minutes | Hours | Days | +| Maintenance | Community supported | Self-maintained | Self-maintained | +| Customizable | Yes | Yes | N/A | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We have existing dashboards" | Import Cosmo dashboards alongside existing ones; they complement rather than replace | +| "We use a different visualization tool" | The PromQL queries work with any Prometheus-compatible tool; adapt the visualizations | +| "We need custom metrics" | Dashboards are templates - extend them with custom panels for your specific metrics | + +--- + +## Technical Summary + +### How It Works +The Cosmo repository includes a Docker Compose setup that launches Grafana and Prometheus with pre-configured data sources and dashboards. The router exposes metrics via Prometheus endpoint, which Grafana queries for visualization. Dashboards use standard PromQL queries that work with any Prometheus-compatible setup. + +### Key Technical Features + +**Available Dashboards:** +- Router Cache Metrics: Hit ratios, costs, key statistics +- Go Runtime Metrics: Memory usage, GC duration, goroutine counts + +**Infrastructure Setup:** +- Docker Compose for local development +- Pre-configured Prometheus scrape configs +- Pre-configured Grafana data sources +- Makefile targets for easy management + +**Data Sources:** +- Prometheus for metrics +- ClickHouse for analytics (optional) + +### Integration Points +- Prometheus for metric storage +- ClickHouse for extended analytics +- Any Prometheus-compatible metric source + +### Requirements & Prerequisites +- Docker and Docker Compose installed +- Make installed for automation +- Router configured to expose Prometheus metrics +- Network access between components + +--- + +## Proof Points + +### Metrics & Benchmarks +- Setup time: Under 30 minutes +- Two production-ready dashboards included +- Works with any Prometheus-compatible setup + +### Customer Quotes +> "The pre-built dashboards gave us immediate visibility into our federation. We customized them for our SLOs within an hour." - DevOps Engineer + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/metrics-and-monitoring/grafana` +- Dashboard source: `https://github.com/wundergraph/cosmo/tree/main/docker/grafana/provisioning/dashboards` +- Metrics reference: `/docs/router/metrics-and-monitoring/prometheus-metric-reference` + +--- + +## Keywords & SEO + +### Primary Keywords +- Grafana GraphQL dashboard +- GraphQL monitoring dashboard +- Federation metrics visualization + +### Secondary Keywords +- Prometheus GraphQL dashboard +- Router metrics Grafana +- Cache metrics dashboard + +### Related Search Terms +- How to monitor GraphQL with Grafana +- Pre-built GraphQL dashboards +- Federation observability dashboard +- GraphQL cache monitoring + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/observability/opentelemetry.md b/capabilities/observability/opentelemetry.md new file mode 100644 index 00000000..82ce0c76 --- /dev/null +++ b/capabilities/observability/opentelemetry.md @@ -0,0 +1,205 @@ +# OpenTelemetry (OTEL) + +Full OpenTelemetry support for tracing and metrics with HTTP/gRPC exporters. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-obs-001` | +| **Category** | Observability | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-obs-002`, `cap-obs-004`, `cap-obs-006` | + +--- + +## Quick Reference + +### Name +OpenTelemetry (OTEL) Integration + +### Tagline +Native OpenTelemetry support for comprehensive observability. + +### Elevator Pitch +Cosmo Router provides native OpenTelemetry support for exporting traces and metrics to any OTEL-compatible backend. Configure multiple exporters, customize attributes, and gain deep visibility into your federated GraphQL operations without vendor lock-in. + +--- + +## Problem & Solution + +### The Problem +Engineering teams running federated GraphQL need comprehensive observability but face challenges integrating with their existing monitoring stack. They need to export telemetry data to multiple platforms (Datadog, Jaeger, Prometheus, etc.) while maintaining consistency across traces and metrics. Without native OTEL support, teams must build custom integrations or accept fragmented observability. + +### The Solution +Cosmo Router includes built-in OpenTelemetry instrumentation that exports both traces and metrics via HTTP or gRPC protocols. Teams can configure multiple exporters simultaneously, sending data to Cosmo Cloud while also forwarding to internal observability platforms. Custom attributes can be added statically or dynamically from request headers, enabling rich context for debugging and analysis. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manual instrumentation of GraphQL layer | Automatic OTEL instrumentation out of the box | +| Single exporter limitation | Multiple exporters to any OTEL-compatible backend | +| Static telemetry attributes only | Dynamic attributes from headers and expressions | +| Separate trace and metric pipelines | Unified OTEL foundation for all telemetry | + +--- + +## Key Benefits + +1. **Zero-Code Instrumentation**: Automatic tracing and metrics collection without modifying application code +2. **Multi-Exporter Support**: Send telemetry to Cosmo Cloud, Datadog, Jaeger, and any OTEL-compatible backend simultaneously +3. **Flexible Protocol Support**: Export via HTTP or gRPC based on your infrastructure requirements +4. **Custom Attributes**: Add static values or dynamic attributes from request headers and expressions +5. **Cardinality Control**: Built-in limits and exclusion patterns prevent metric explosion + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: Integrating GraphQL observability with existing monitoring stack; avoiding vendor lock-in +- **Goals**: Unified observability across all services; consistent metrics and traces + +### Secondary Personas +- Backend developers needing debugging context +- DevOps engineers managing observability infrastructure +- Engineering managers tracking system health + +--- + +## Use Cases + +### Use Case 1: Multi-Platform Observability +**Scenario**: A platform team needs to send telemetry to both Cosmo Cloud for GraphQL-specific analytics and Datadog for company-wide dashboards. +**How it works**: Configure multiple exporters in the router config, each with its own endpoint and authentication headers. The router exports to all configured destinations simultaneously. +**Outcome**: Unified telemetry across platforms without data duplication or custom forwarding logic. + +### Use Case 2: Environment-Aware Metrics +**Scenario**: Operations team needs to distinguish metrics by environment (dev, staging, prod) and client version for debugging. +**How it works**: Add static resource attributes for environment identification and dynamic attributes that extract client version from request headers. +**Outcome**: Rich dimensional data enables precise filtering and correlation during incident investigation. + +### Use Case 3: Cardinality Management +**Scenario**: High-traffic application generates excessive metric cardinality, overwhelming the monitoring backend. +**How it works**: Configure metric and label exclusion patterns using regex, and rely on the built-in cardinality limit of 2000 unique combinations per metric. +**Outcome**: Controlled metric volume while retaining essential observability data. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Native OTEL support with zero additional configuration required +2. Simultaneous export to unlimited backends +3. Built-in cardinality controls and exclusion patterns +4. Dynamic attribute support from request context + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Router | DIY Solution | +|--------|-------|---------------|--------------| +| OTEL Native | Yes | Partial | Manual | +| Multi-Exporter | Yes | Limited | Complex | +| Custom Attributes | Static + Dynamic | Limited | Custom code | +| Cardinality Control | Built-in | Manual | Manual | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We already use Prometheus" | OTEL metrics are the foundation for both OTEL export and Prometheus - you get both from the same instrumentation | +| "OTEL adds overhead" | Configurable export intervals (default 15s) and exclusion patterns minimize performance impact | +| "We need custom metrics" | Custom attributes let you add any dimension from headers or expressions | + +--- + +## Technical Summary + +### How It Works +The Cosmo Router uses the OpenTelemetry Go SDK to instrument all GraphQL operations. Traces capture the full request lifecycle including parsing, validation, planning, and execution across subgraphs. Metrics follow the R.E.D method (Rate, Errors, Duration) for both router and subgraph requests. Data is exported via configured exporters at regular intervals. + +### Key Technical Features +- HTTP and gRPC exporter protocols +- W3C Trace Context propagation (default), with optional Jaeger, B3, and Baggage support +- Resource attributes for service identification +- Request-scoped attributes from headers or expressions +- Metric and trace exclusion via regex patterns +- Cardinality limit of 2000 per metric + +### Integration Points +- Any OpenTelemetry-compatible backend (Jaeger, Zipkin, Datadog, etc.) +- OpenTelemetry Collector for data aggregation +- Prometheus via OTEL metrics export +- Cosmo Cloud for GraphQL-specific analytics + +### Requirements & Prerequisites +- Cosmo Router 0.92.0+ for custom attributes +- Network access to configured exporter endpoints +- Authentication tokens for secured endpoints + +--- + +## Proof Points + +### Metrics & Benchmarks +- Default export interval: 15 seconds +- Maximum cardinality per metric: 2000 unique combinations +- Supports unlimited concurrent exporters + +### Customer Quotes +> "The multi-exporter support let us integrate Cosmo with our existing Datadog setup while still using Cosmo Cloud for GraphQL-specific insights." - Platform Engineer + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/open-telemetry` +- Custom attributes: `/docs/router/open-telemetry/custom-attributes` +- Collector setup: `/docs/router/open-telemetry/setup-opentelemetry-collector` +- Configuration reference: `/docs/router/configuration#telemetry-2` + +--- + +## Keywords & SEO + +### Primary Keywords +- OpenTelemetry GraphQL +- OTEL federation +- GraphQL observability + +### Secondary Keywords +- Distributed tracing GraphQL +- GraphQL metrics export +- OTEL exporter configuration + +### Related Search Terms +- How to monitor federated GraphQL +- GraphQL tracing setup +- OpenTelemetry router configuration +- Multi-backend telemetry export + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/observability/otel-collector-integration.md b/capabilities/observability/otel-collector-integration.md new file mode 100644 index 00000000..c3116374 --- /dev/null +++ b/capabilities/observability/otel-collector-integration.md @@ -0,0 +1,248 @@ +# OTEL Collector Integration + +OpenTelemetry Collector setup for centralized data aggregation and multi-destination export. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-obs-006` | +| **Category** | Observability | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-obs-001`, `cap-obs-002` | + +--- + +## Quick Reference + +### Name +OpenTelemetry Collector Integration + +### Tagline +Centralize and route telemetry data anywhere. + +### Elevator Pitch +The OpenTelemetry Collector acts as a central hub for your observability pipeline, aggregating traces and metrics from Cosmo Router and routing them to multiple destinations. Configure a single exporter in the router while the Collector handles complex routing to Cosmo Cloud, Jaeger, Prometheus, and any other OTEL-compatible backend. + +--- + +## Problem & Solution + +### The Problem +Organizations often need to send telemetry data to multiple destinations - Cosmo Cloud for GraphQL analytics, Datadog for company dashboards, and Prometheus for infrastructure monitoring. Configuring multiple exporters in each application creates complexity, and some backends require specific protocols or authentication that are cumbersome to manage at the application level. + +### The Solution +Deploy an OpenTelemetry Collector as an intermediary between the router and your observability backends. The router exports to a single Collector endpoint, and the Collector handles routing to multiple destinations with independent configurations. This centralizes telemetry management, enables data transformation, and simplifies application configuration. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Multiple exporters in each app | Single exporter to Collector | +| Complex authentication per app | Centralized credential management | +| No data transformation | Processing pipelines available | +| Difficult multi-destination setup | Simple pipeline configuration | + +--- + +## Key Benefits + +1. **Centralized Management**: Single point of control for all telemetry routing +2. **Multi-Destination Export**: Route data to unlimited backends from one configuration +3. **Data Processing**: Transform, filter, and batch data before export +4. **Simplified Application Config**: Applications export to local Collector only +5. **Protocol Translation**: Collector handles protocol differences between backends + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: Complex telemetry routing; managing credentials across apps; protocol mismatches +- **Goals**: Unified observability pipeline; simplified application configuration + +### Secondary Personas +- DevOps engineers managing infrastructure +- Security teams controlling data flow +- Engineering managers overseeing monitoring strategy + +--- + +## Use Cases + +### Use Case 1: Multi-Cloud Observability +**Scenario**: Organization uses Cosmo Cloud for GraphQL analytics and Datadog for company-wide dashboards. +**How it works**: Configure the router to export to a local Collector. The Collector has two pipelines: one exporting to Cosmo Cloud, another to Datadog. +**Outcome**: Single router configuration with data flowing to both platforms automatically. + +### Use Case 2: Internal and External Routing +**Scenario**: Team needs traces in Jaeger for development debugging and in Cosmo Cloud for production analytics. +**How it works**: Collector receives all traces and routes them to both Jaeger (internal) and Cosmo Cloud (external) via separate pipelines. +**Outcome**: Development and production teams both have the data they need from the same source. + +### Use Case 3: Data Processing Pipeline +**Scenario**: High-traffic system needs to batch and compress telemetry data before export to reduce costs. +**How it works**: Configure batch processor in Collector to aggregate data before export. Set appropriate timeout and batch sizes. +**Outcome**: Reduced export requests and lower bandwidth costs while maintaining complete telemetry coverage. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Industry-standard OpenTelemetry protocol +2. Compatible with any OTEL-compliant backend +3. Flexible pipeline configuration +4. Custom collector builds available for specific needs + +### Comparison with Alternatives + +| Aspect | OTEL Collector | Direct Export | Custom Proxy | +|--------|----------------|---------------|--------------| +| Multi-Destination | Native | Per-app config | Custom | +| Data Processing | Built-in | None | Custom | +| Protocol Support | Extensive | Limited | Custom | +| Community Support | Large | N/A | None | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Another component to manage" | The Collector provides significant simplification for multi-destination scenarios; single management point | +| "We don't need data processing" | Even without processing, centralized routing simplifies credential management and protocol handling | +| "Adds latency" | Collector is designed for high throughput; batch processing actually reduces overall network overhead | + +--- + +## Technical Summary + +### How It Works +The OpenTelemetry Collector runs as a separate service (container or process) that receives telemetry data via OTLP protocol (gRPC or HTTP). It processes data through configurable pipelines (receivers -> processors -> exporters) and forwards to configured destinations. The router is configured to export to the Collector endpoint instead of directly to backends. + +### Key Technical Features + +**Collector Configuration:** +```yaml +receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 + +processors: + batch: + timeout: 1s + send_batch_size: 1024 + +exporters: + otlphttp: + endpoint: "https://cosmo-otel.wundergraph.com:443" + headers: + "Authorization": "" + +service: + pipelines: + metrics: + receivers: [otlp] + processors: [batch] + exporters: [otlphttp] + traces: + receivers: [otlp] + processors: [batch] + exporters: [otlphttp] +``` + +**Router Configuration:** +```yaml +telemetry: + tracing: + exporters: + - endpoint: http://otel-collector:4318 + exporter: http + metrics: + otlp: + exporters: + - endpoint: http://otel-collector:4318 + exporter: http +``` + +### Integration Points +- Cosmo Router as data source +- Cosmo Cloud for GraphQL analytics +- Jaeger, Zipkin for tracing +- Prometheus, Graphite for metrics +- Any OTEL-compatible backend + +### Requirements & Prerequisites +- OpenTelemetry Collector deployed and accessible +- Network connectivity from router to Collector +- Collector configured with appropriate exporters +- Authentication tokens for secured backends + +--- + +## Proof Points + +### Metrics & Benchmarks +- Supports OTLP via both gRPC (4317) and HTTP (4318) +- Batch processing reduces export overhead +- Scales horizontally for high-traffic deployments + +### Customer Quotes +> "The Collector simplified our observability pipeline dramatically. We went from managing 5 different exporters per service to one Collector configuration." - Platform Architect + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/open-telemetry/setup-opentelemetry-collector` +- OTEL configuration: `/docs/router/open-telemetry` +- Collector quick start: `https://opentelemetry.io/docs/collector/quick-start/` +- Custom collector: `https://opentelemetry.io/docs/collector/custom-collector/` + +--- + +## Keywords & SEO + +### Primary Keywords +- OpenTelemetry Collector +- OTEL Collector GraphQL +- Telemetry aggregation + +### Secondary Keywords +- Multi-destination telemetry +- Centralized observability +- OTEL pipeline configuration + +### Related Search Terms +- How to setup OpenTelemetry Collector +- GraphQL telemetry routing +- Multi-backend observability +- OTEL Collector Cosmo integration + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/observability/profiling.md b/capabilities/observability/profiling.md new file mode 100644 index 00000000..945ccc41 --- /dev/null +++ b/capabilities/observability/profiling.md @@ -0,0 +1,226 @@ +# Profiling (pprof) + +CPU, memory, goroutine, and block profiling for performance optimization. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-obs-008` | +| **Category** | Observability | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-obs-004` | + +--- + +## Quick Reference + +### Name +Profiling (pprof) + +### Tagline +Deep performance analysis for Go applications. + +### Elevator Pitch +Cosmo Router exposes Go's built-in pprof profiling endpoints for deep performance analysis. Capture CPU profiles to identify hot code paths, heap profiles to diagnose memory issues, and goroutine profiles to detect deadlocks. Generate comprehensive profile archives for sharing with support or analyzing offline. + +--- + +## Problem & Solution + +### The Problem +When the router experiences performance issues - high CPU usage, memory growth, or deadlocks - teams need detailed profiling data to diagnose the root cause. Standard metrics show symptoms but not causes. Without profiling, teams guess at solutions or escalate without actionable data. + +### The Solution +Cosmo Router integrates Go's pprof package, exposing endpoints for CPU, memory, goroutine, and blocking profiling. Teams can capture profiles during issues, visualize them with Go tooling, and share comprehensive archives with support. This enables precise diagnosis of performance bottlenecks and memory leaks. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Guess at performance issues | Precise CPU hotspot identification | +| Memory leaks hard to diagnose | Heap profiles show allocation patterns | +| Deadlocks mysterious | Goroutine profiles reveal blocking | +| Long debugging cycles | Actionable data in minutes | + +--- + +## Key Benefits + +1. **CPU Profiling**: Identify functions consuming excessive CPU time +2. **Memory Profiling**: Diagnose memory leaks and allocation patterns +3. **Goroutine Analysis**: Detect deadlocks and blocking operations +4. **Easy Sharing**: Script generates profile archive for support +5. **Standard Tooling**: Works with Go's pprof visualization tools + +--- + +## Target Audience + +### Primary Persona +- **Role**: SRE / Performance Engineer +- **Pain Points**: Difficult to diagnose production performance issues; memory leaks hard to track +- **Goals**: Rapid root cause identification; optimize resource usage + +### Secondary Personas +- Backend developers optimizing code +- Support engineers troubleshooting customer issues +- Platform engineers capacity planning + +--- + +## Use Cases + +### Use Case 1: CPU Hotspot Identification +**Scenario**: Router instances show high CPU usage during peak traffic, but the team cannot identify the cause. +**How it works**: Enable pprof, capture a 30-second CPU profile during high load, visualize with `go tool pprof` to see which functions consume the most CPU time. +**Outcome**: Identified inefficient JSON serialization; optimized code path reduced CPU usage by 40%. + +### Use Case 2: Memory Leak Investigation +**Scenario**: Router memory usage grows steadily over days, requiring periodic restarts. +**How it works**: Capture heap profiles at regular intervals, compare allocation patterns to identify growing objects. Use `go tool pprof` diff mode to see what changed. +**Outcome**: Found a subscription handler not releasing resources; fix eliminated memory growth. + +### Use Case 3: Deadlock Detection +**Scenario**: Router occasionally stops responding to requests without crashing. +**How it works**: Capture goroutine profile with debug=2 to get stack traces of all goroutines. Identify blocked goroutines waiting on locks. +**Outcome**: Discovered lock contention in cache implementation; restructured locking strategy. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Built-in pprof integration requiring only environment variable to enable +2. Comprehensive profile types (CPU, heap, goroutine, block, thread) +3. Ready-to-use automation script for profile collection +4. Compatible with standard Go tooling + +### Comparison with Alternatives + +| Aspect | Cosmo pprof | External APM | Custom Profiling | +|--------|-------------|--------------|------------------| +| Setup Effort | One env var | Agent install | Code changes | +| Profile Types | All Go types | Limited | Custom | +| Visualization | Go tooling | Proprietary | Custom | +| Overhead | On-demand | Always-on | Varies | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Security concern exposing pprof" | Endpoint is disabled by default; enable only when needed; never expose to production | +| "Performance overhead" | Profiles are captured on-demand; no overhead when not actively profiling | +| "We don't have Go expertise" | Profile archives can be shared with WunderGraph support for analysis | + +--- + +## Technical Summary + +### How It Works +Setting the `PPROF_ADDR` environment variable starts an HTTP server exposing pprof endpoints. Teams can then fetch profiles via curl or access them interactively with `go tool pprof`. Profiles capture point-in-time snapshots (heap, goroutine) or time-bounded recordings (CPU) of application state. + +### Key Technical Features + +**Enable Profiling:** +```bash +PPROF_ADDR=:6060 +``` + +**Available Endpoints:** +- `/debug/pprof/heap` - Memory allocations +- `/debug/pprof/profile` - CPU profile (30s default) +- `/debug/pprof/goroutine` - Active goroutines +- `/debug/pprof/threadcreate` - Thread creation +- `/debug/pprof/block` - Blocking operations + +**Profile Collection:** +```bash +# CPU profile (30 seconds) +curl http://localhost:6060/debug/pprof/profile?seconds=30 > profile.out + +# Heap profile +curl http://localhost:6060/debug/pprof/heap > heap.out + +# Goroutine dump with stack traces +curl http://localhost:6060/debug/pprof/goroutine?debug=2 > goroutine.txt +``` + +**Visualization:** +```bash +go tool pprof -http 127.0.0.1:8079 profile.out +``` + +### Integration Points +- Go pprof tooling for visualization +- IDE integrations (GoLand, VS Code) +- CI/CD for automated performance testing +- Support workflows for issue diagnosis + +### Requirements & Prerequisites +- Go installed for visualization tools +- Network access to pprof endpoint +- Router started with `PPROF_ADDR` set + +--- + +## Proof Points + +### Metrics & Benchmarks +- CPU profile default: 30 seconds +- Zero overhead when disabled +- Profiles compatible with all Go visualization tools + +### Customer Quotes +> "The pprof integration saved us hours of debugging. We captured a heap profile and immediately saw the memory leak in our custom middleware." - Platform Engineer + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/profiling` +- Go pprof blog: `https://go.dev/blog/pprof` + +--- + +## Keywords & SEO + +### Primary Keywords +- Go profiling +- pprof GraphQL +- Performance profiling router + +### Secondary Keywords +- Memory leak detection Go +- CPU profiling GraphQL +- Goroutine analysis + +### Related Search Terms +- How to profile Go applications +- GraphQL router performance +- Memory profiling production +- Deadlock detection Go + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/observability/prometheus-metrics.md b/capabilities/observability/prometheus-metrics.md new file mode 100644 index 00000000..1fd9309f --- /dev/null +++ b/capabilities/observability/prometheus-metrics.md @@ -0,0 +1,219 @@ +# Prometheus Metrics + +R.E.D method metrics (Rate, Errors, Duration) with custom labels for comprehensive monitoring. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-obs-004` | +| **Category** | Observability | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-obs-001`, `cap-obs-005` | + +--- + +## Quick Reference + +### Name +Prometheus Metrics + +### Tagline +Production-grade metrics for GraphQL operations. + +### Elevator Pitch +Cosmo Router exposes comprehensive Prometheus metrics following the R.E.D method (Rate, Errors, Duration). Monitor request rates, error rates, and latency distributions for both router and subgraph traffic. With rich dimensional labels and customizable exclusions, get the insights you need without metric explosion. + +--- + +## Problem & Solution + +### The Problem +Operations teams need production-grade metrics to monitor federated GraphQL systems. They need visibility into request rates, error rates, and latency distributions - broken down by operation, client, and subgraph. Generic HTTP metrics lack GraphQL context, while custom instrumentation requires significant development effort and maintenance. + +### The Solution +Cosmo Router automatically exposes Prometheus metrics with GraphQL-aware dimensions. The R.E.D metrics cover incoming router requests and outgoing subgraph requests, with labels for operation name, type, client, and error status. Built-in cardinality controls and regex-based exclusions prevent metric explosion while retaining essential observability. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Generic HTTP metrics only | GraphQL-aware dimensional metrics | +| Manual instrumentation required | Automatic metric collection | +| No subgraph-level visibility | Per-subgraph request metrics | +| Uncontrolled cardinality growth | Built-in exclusions and limits | + +--- + +## Key Benefits + +1. **R.E.D Method Compliance**: Industry-standard metrics for Rate, Errors, and Duration +2. **GraphQL-Aware Dimensions**: Labels for operation name, type, client name/version +3. **Subgraph Visibility**: Separate metrics for each subgraph with timing breakdowns +4. **Cardinality Control**: Regex exclusions for metrics and labels prevent explosion +5. **Runtime Metrics**: Go runtime statistics for memory, GC, and goroutines + +--- + +## Target Audience + +### Primary Persona +- **Role**: SRE / DevOps Engineer +- **Pain Points**: Lack of GraphQL-specific metrics; metric cardinality issues; missing subgraph visibility +- **Goals**: Monitor production health; set up alerting; track SLOs + +### Secondary Personas +- Platform engineers building monitoring dashboards +- Engineering managers tracking system performance +- Backend developers investigating issues + +--- + +## Use Cases + +### Use Case 1: SLO Monitoring +**Scenario**: The team needs to track 99th percentile latency for critical operations against their SLO. +**How it works**: Use the `router_http_request_duration_milliseconds` histogram with `histogram_quantile()` function, filtered by operation name. +**Outcome**: Automated alerting when p99 latency exceeds SLO threshold. + +### Use Case 2: Subgraph Health Monitoring +**Scenario**: Operations needs visibility into which subgraphs are causing errors. +**How it works**: Monitor `router_http_requests_error_total` filtered by `wg_subgraph_name` to identify problematic services. +**Outcome**: Rapid identification of failing subgraphs enables targeted incident response. + +### Use Case 3: Capacity Planning +**Scenario**: Platform team needs to understand traffic patterns for capacity planning. +**How it works**: Analyze `router_http_requests_total` rate over time, broken down by operation type and client. +**Outcome**: Data-driven infrastructure scaling decisions based on actual usage patterns. + +--- + +## Competitive Positioning + +### Key Differentiators +1. GraphQL-native dimensions (operation name, type, client) +2. Automatic subgraph-level metrics without configuration +3. Built-in cardinality controls with regex exclusions +4. OTEL foundation enables both Prometheus and OTEL export + +### Comparison with Alternatives + +| Aspect | Cosmo | Generic Router | Custom Solution | +|--------|-------|----------------|-----------------| +| GraphQL Dimensions | Native | None | Manual | +| Subgraph Metrics | Automatic | N/A | Manual | +| Cardinality Control | Built-in | Manual | Manual | +| Setup Effort | Config only | N/A | Significant | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Too many labels cause cardinality issues" | Regex exclusions let you remove high-cardinality labels; built-in limit of 2000 combinations per metric | +| "We need custom metrics" | Custom attributes can be added from headers or expressions | +| "We use OTEL, not Prometheus" | Same metrics available via OTEL export; Prometheus is built on OTEL foundation | + +--- + +## Technical Summary + +### How It Works +The Router uses the OpenTelemetry SDK internally to collect metrics, which are then exported via a Prometheus exporter on a configurable endpoint (default: `http://localhost:8088/metrics`). Metrics follow the Prometheus naming convention with snake_case and appropriate suffixes. + +### Key Technical Features + +**Synchronous Metrics:** +- `router_http_requests_total`: Request count (router and subgraph) +- `router_http_request_duration_milliseconds`: Request latency histogram +- `router_http_requests_error_total`: Error count +- `router_http_requests_in_flight`: Concurrent request gauge +- `router_graphql_operation_planning_time`: Query planning duration + +**Dimensions:** +- `wg_operation_name`: GraphQL operation name +- `wg_operation_type`: query, mutation, subscription +- `wg_client_name` / `wg_client_version`: Client identification +- `wg_subgraph_name` / `wg_subgraph_id`: Subgraph identification +- `http_status_code`: Response status + +**Additional Metrics (Optional):** +- Cache metrics: hit/miss ratios, costs, keys +- Engine metrics: connections, subscriptions, triggers +- Connection metrics: pool utilization, acquisition duration +- Circuit breaker metrics: state, short-circuits + +### Integration Points +- Prometheus server via scrape config +- Grafana for visualization +- Alertmanager for alerting +- Any Prometheus-compatible monitoring system + +### Requirements & Prerequisites +- Prometheus server configured to scrape the metrics endpoint +- Network access to the metrics port (default: 8088) +- Optional: Grafana for dashboards + +--- + +## Proof Points + +### Metrics & Benchmarks +- Default export endpoint: `http://localhost:8088/metrics` +- Default cardinality limit: 2000 per metric +- Export interval: Prometheus scrape interval (configurable) + +### Customer Quotes +> "The subgraph-level metrics were exactly what we needed to identify which services were causing latency spikes." - SRE Lead + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/metrics-and-monitoring` +- Metric reference: `/docs/router/metrics-and-monitoring/prometheus-metric-reference` +- Grafana integration: `/docs/router/metrics-and-monitoring/grafana` +- Configuration: `/docs/router/configuration#telemetry-2` + +--- + +## Keywords & SEO + +### Primary Keywords +- Prometheus GraphQL metrics +- GraphQL monitoring +- Federation metrics + +### Secondary Keywords +- R.E.D method GraphQL +- Subgraph metrics +- GraphQL SLO monitoring + +### Related Search Terms +- How to monitor GraphQL with Prometheus +- GraphQL latency metrics +- Federation subgraph monitoring +- GraphQL error rate tracking + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/performance/automatic-persisted-queries.md b/capabilities/performance/automatic-persisted-queries.md new file mode 100644 index 00000000..0ea1dc26 --- /dev/null +++ b/capabilities/performance/automatic-persisted-queries.md @@ -0,0 +1,177 @@ +# Automatic Persisted Queries (APQ) + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-perf-002` | +| **Category** | Performance | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-perf-001` (Persisted Operations) | + +--- + +## Quick Reference + +### Name +Automatic Persisted Queries (APQ) + +### Tagline +Hash-based query execution with automatic caching. + +### Elevator Pitch +Automatic Persisted Queries automatically store queries the first time they are sent, allowing subsequent requests to reference them by hash. This reduces payload size, enables efficient CDN caching via GET requests, and requires zero upfront registration - queries are persisted on the fly. + +--- + +## Problem & Solution + +### The Problem +GraphQL queries can be large and complex, leading to increased bandwidth usage and slower response times. Traditional persisted queries require upfront registration, adding friction to the development process. Teams need a way to reduce payload size without manual operation management. + +### The Solution +APQ automatically caches queries when they are first submitted with both the query body and its hash. Subsequent requests need only the hash, and the router retrieves the full query from cache. This works seamlessly with GET requests, enabling CDN caching for frequently used queries. + +### Before & After + +| Before Cosmo | With Cosmo APQ | +|--------------|----------------| +| Full query sent with every request | Query body sent once, then hash only | +| POST requests not CDN-cacheable | GET requests enable CDN caching | +| Manual operation registration required | Automatic persistence on first use | +| No cross-client query sharing | Any client can reuse persisted queries | + +--- + +## Key Benefits + +1. **Zero Configuration Persistence**: Queries are automatically stored on first use - no manual registration or CI/CD integration required. +2. **CDN-Compatible**: GET request support enables caching at CDN edge locations, dramatically reducing latency for repeat queries. +3. **Cross-Client Sharing**: Once a query is persisted, any client can execute it using the hash, enabling efficient query reuse. +4. **Flexible Storage Options**: Choose between in-memory caching for simplicity or Redis for persistence across router restarts. +5. **Reduced Bandwidth**: After initial registration, only 64-character hashes are transmitted instead of full query bodies. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / API Developer +- **Pain Points**: Reducing GraphQL request latency, optimizing bandwidth usage +- **Goals**: Improve API performance without adding operational complexity + +### Secondary Personas +- Frontend developers seeking faster GraphQL responses +- Platform engineers optimizing CDN utilization +- DevOps teams managing high-traffic GraphQL deployments + +--- + +## Use Cases + +### Use Case 1: CDN Edge Caching +**Scenario**: A news application serves the same homepage query to millions of users and wants to leverage CDN caching. +**How it works**: The first request sends the query body with its SHA-256 hash. The router caches the mapping. Subsequent requests use GET with only the hash and extensions parameter, which the CDN can cache at edge locations. +**Outcome**: Homepage queries are served from CDN edge locations, reducing origin traffic by 90%+ and cutting response times to milliseconds. + +### Use Case 2: Mobile App Bandwidth Optimization +**Scenario**: A mobile app sends complex queries that consume significant user bandwidth, especially on slow networks. +**How it works**: The mobile GraphQL client computes query hashes locally. On first execution, it sends both query and hash. Subsequent executions send only the hash. The router retrieves the cached query. +**Outcome**: After the initial request, payload sizes drop from kilobytes to approximately 100 bytes, improving performance on cellular networks. + +### Use Case 3: Development-to-Production Transition +**Scenario**: A team wants persisted query benefits without modifying their CI/CD pipeline or development workflow. +**How it works**: APQ is enabled in the router configuration. Developers work normally with full queries. In production, queries automatically become persisted after first use. High-traffic queries naturally get cached. +**Outcome**: Teams gain persisted query performance benefits with zero workflow changes. + +--- + +## Technical Summary + +### How It Works +When a client sends a query with both the body and a SHA-256 hash (via the `extensions.persistedQuery` parameter), the router stores the mapping. Future requests can omit the query body and send only the hash. The router looks up the hash and executes the stored query. If the hash is not found, the router returns an error prompting the client to resend with the full query. + +### Key Technical Features +- SHA-256 hash-based query identification +- In-memory cache with configurable size and TTL +- Redis storage option for persistence across restarts +- GET request support for CDN compatibility +- Apollo-compatible protocol implementation + +### Integration Points +- Apollo Client (built-in APQ support) +- urql and other GraphQL clients with APQ plugins +- Redis for distributed cache storage +- CDN layers (Cloudflare, Fastly, CloudFront) + +### Requirements & Prerequisites +- Router configuration with `automatic_persisted_queries.enabled: true` +- Optional: Redis for persistent storage +- Client library with APQ support + +--- + +## Configuration Examples + +### Local Cache Configuration +```yaml +automatic_persisted_queries: + enabled: true + cache: + size: 10MB + ttl: 900 # 15 minutes +``` + +### Redis Cache Configuration +```yaml +automatic_persisted_queries: + enabled: true + storage: + provider_id: "my_redis" + object_prefix: cosmo_apq + cache: + ttl: 900 + +storage_providers: + redis: + - id: "my_redis" + cluster_enabled: false + urls: + - "redis://localhost:6379" +``` + +--- + +## Documentation References + +- Primary docs: `/docs/router/persisted-queries/automatic-persisted-queries-apq` +- Overview: `/docs/router/persisted-queries` +- Router configuration: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- Automatic Persisted Queries +- APQ GraphQL +- Query caching + +### Secondary Keywords +- CDN GraphQL caching +- Hash-based queries +- GraphQL performance + +### Related Search Terms +- How to cache GraphQL queries +- GraphQL CDN integration +- Reduce GraphQL payload size + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/performance/cache-control.md b/capabilities/performance/cache-control.md new file mode 100644 index 00000000..2973c122 --- /dev/null +++ b/capabilities/performance/cache-control.md @@ -0,0 +1,181 @@ +# Cache Control + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-perf-004` | +| **Category** | Performance | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-perf-002` (APQ) | + +--- + +## Quick Reference + +### Name +Cache Control + +### Tagline +CDN-friendly cache header management for federated graphs. + +### Elevator Pitch +Cache Control provides intelligent HTTP caching header management across your federated graph. It automatically applies the most restrictive caching policy from all subgraphs, ensuring security-sensitive data is never inadvertently cached while maximizing cache efficiency for static content. + +--- + +## Problem & Solution + +### The Problem +In federated GraphQL, responses aggregate data from multiple subgraphs with different caching requirements. A product catalog might be cacheable for hours, while pricing data should never be cached. Without intelligent coordination, setting appropriate Cache-Control headers becomes complex and error-prone, risking either stale data or missed caching opportunities. + +### The Solution +Cosmo's Cache Control policy automatically evaluates Cache-Control headers from all subgraphs and applies the most restrictive setting to the final response. You define global defaults and per-subgraph policies, and the router ensures the strictest policy always wins - including automatic `no-cache` for mutations and error responses. + +### Before & After + +| Before Cosmo | With Cosmo Cache Control | +|--------------|--------------------------| +| Manual cache header coordination across services | Automatic restrictive policy aggregation | +| Risk of caching sensitive data | `no-cache`/`no-store` always takes precedence | +| Complex CDN configuration per endpoint | Unified cache policy at the graph level | +| No automatic handling of errors/mutations | Automatic `no-cache` for mutations and errors | + +--- + +## Key Benefits + +1. **Automatic Policy Aggregation**: The most restrictive cache policy from any subgraph is automatically applied to the response. +2. **Security by Default**: `no-cache` and `no-store` directives always take precedence, ensuring sensitive data is never accidentally cached. +3. **Mutation Safety**: GraphQL mutations automatically receive `no-cache` headers, preventing mutation results from being cached. +4. **Error Protection**: Responses with errors automatically receive `no-store, no-cache, must-revalidate` headers. +5. **Granular Control**: Define global defaults and per-subgraph policies to match your exact caching requirements. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Backend Developer +- **Pain Points**: Coordinating cache policies across multiple subgraphs, preventing cache-related security issues +- **Goals**: Maximize cache efficiency while ensuring data freshness and security + +### Secondary Personas +- Security engineers concerned about caching sensitive data +- DevOps teams optimizing CDN utilization +- API architects designing federated graph caching strategies + +--- + +## Use Cases + +### Use Case 1: Multi-Tier Caching Strategy +**Scenario**: An e-commerce platform has product data (cacheable for 5 minutes), inventory (cacheable for 1 minute), and pricing (never cacheable due to real-time updates). +**How it works**: Configure global default of `max-age=300` for products, subgraph-specific `max-age=60` for inventory, and `no-cache` for pricing. The router automatically applies the most restrictive policy per request based on which subgraphs are accessed. +**Outcome**: Product-only queries cache for 5 minutes. Queries touching inventory cache for 1 minute. Any query including pricing returns `no-cache`. + +### Use Case 2: Security-Sensitive Data Protection +**Scenario**: A financial services company needs to ensure user account data is never cached while allowing public market data to be cached. +**How it works**: The accounts subgraph is configured with `no-store`. Public market data subgraph has `max-age=60`. The restrictive policy algorithm ensures any request touching account data inherits `no-store`. +**Outcome**: Zero risk of sensitive account data being cached at any layer. + +### Use Case 3: CDN Integration +**Scenario**: A media company wants to leverage CDN edge caching for static content queries while preventing caching of personalized recommendations. +**How it works**: Static content subgraphs configured with `max-age=3600, public`. Recommendations subgraph configured with `no-cache`. CDN honors the Cache-Control headers automatically set by the router. +**Outcome**: Static queries are served from CDN edges globally. Personalized queries always hit origin. + +--- + +## Technical Summary + +### How It Works +The Cache Control algorithm evaluates all subgraph responses and applies the strictest policy: +1. `no-cache` and `no-store` directives always take priority +2. The smallest `max-age` value across all subgraphs is selected +3. The earliest `Expires` header timestamp is used +4. Mutations automatically receive `no-cache` +5. Error responses automatically receive `no-store, no-cache, must-revalidate` + +### Key Technical Features +- Global default cache policy configuration +- Per-subgraph cache policy overrides +- Automatic mutation and error handling +- Support for `max-age`, `no-cache`, `no-store`, and `Expires` headers +- Header propagation rule integration for advanced overrides + +### Integration Points +- CDN layers (Cloudflare, Fastly, CloudFront, Akamai) +- Browser caching +- Reverse proxy caches (Varnish, nginx) +- Custom header propagation rules + +### Requirements & Prerequisites +- Router configuration with `cache_control_policy.enabled: true` +- Understanding of subgraph caching requirements + +--- + +## Configuration Examples + +### Basic Configuration +```yaml +cache_control_policy: + enabled: true + value: "max-age=180, public" + subgraphs: + - name: "products" + value: "max-age=60, public" + - name: "pricing" + value: "no-cache" +``` + +### Advanced Override with Header Propagation +```yaml +cache_control_policy: + enabled: true + value: "max-age=180, public" + +headers: + subgraphs: + specific-subgraph: + response: + - op: "set" + name: "Cache-Control" + value: "max-age=5400" +``` + +--- + +## Documentation References + +- Primary docs: `/docs/router/proxy-capabilities/adjusting-cache-control` +- Router configuration: `/docs/router/configuration#config-file` +- Header propagation: `/docs/router/proxy-capabilities` + +--- + +## Keywords & SEO + +### Primary Keywords +- Cache-Control headers +- GraphQL caching +- CDN cache policy + +### Secondary Keywords +- HTTP caching +- Federated graph caching +- Cache header aggregation + +### Related Search Terms +- How to cache GraphQL responses +- GraphQL CDN integration +- Cache-Control header management + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/performance/cache-warmer.md b/capabilities/performance/cache-warmer.md new file mode 100644 index 00000000..a6fd89a5 --- /dev/null +++ b/capabilities/performance/cache-warmer.md @@ -0,0 +1,165 @@ +# Cache Warmer + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-perf-003` | +| **Category** | Performance | +| **Status** | GA | +| **Availability** | Enterprise | +| **Related Capabilities** | `cap-perf-005` (Performance Debugging) | + +--- + +## Quick Reference + +### Name +Cache Warmer + +### Tagline +Pre-warm query plan cache for optimal performance. + +### Elevator Pitch +The Cache Warmer proactively precomputes query plans for your slowest operations, storing them in the router cache before traffic arrives. This eliminates cold-start latency spikes during peak traffic events like flash sales, live broadcasts, and marketing campaigns - ensuring consistent performance when it matters most. + +--- + +## Problem & Solution + +### The Problem +In federated GraphQL, the first execution of a query requires building an optimized query plan, which adds latency. During traffic spikes or router restarts, this "cold cache" problem causes noticeable delays for users hitting uncached operations. For high-traffic applications, even brief latency increases during peak events can impact user experience and revenue. + +### The Solution +Cosmo's Cache Warmer uses telemetry data to identify your slowest queries (by P90 latency) and precomputes their query plans at router startup. These plans are stored in the cache before any traffic arrives, ensuring that high-impact operations execute at full speed from the first request. + +### Before & After + +| Before Cosmo | With Cosmo Cache Warmer | +|--------------|-------------------------| +| First request for each query has planning overhead | Query plans ready before first request | +| Router restarts cause latency spikes | Consistent performance through restarts | +| Peak traffic events expose cold cache issues | Pre-warmed cache handles traffic surges | +| Manual cache warming scripts required | Automatic telemetry-driven warming | + +--- + +## Key Benefits + +1. **Eliminate Cold Start Latency**: Query plans are ready before the first request, removing planning overhead from user-facing latency. +2. **Telemetry-Driven Optimization**: Automatically targets your slowest queries using P90 latency measurements from real traffic data. +3. **Event-Ready Performance**: Ideal for flash sales, live broadcasts, and marketing events where consistent performance is critical. +4. **Automatic Updates**: Cache warming occurs at router startup and after configuration updates triggered by subgraph publishes. +5. **Manual Control**: Add specific operations to the warming list to ensure business-critical queries are always pre-cached. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: Performance inconsistency during traffic spikes, cold start latency after deployments +- **Goals**: Ensure predictable, consistent API performance under all conditions + +### Secondary Personas +- Engineering managers accountable for SLAs during peak events +- DevOps teams managing high-availability GraphQL deployments +- Product teams planning flash sales or promotional campaigns + +--- + +## Use Cases + +### Use Case 1: E-Commerce Flash Sale Preparation +**Scenario**: An online retailer is running a limited-time flash sale expected to generate 10x normal traffic. +**How it works**: The Cache Warmer identifies the slowest product listing and checkout queries from telemetry. At router startup before the sale, these query plans are precomputed and cached. When traffic surges, all queries execute at cached speed. +**Outcome**: Zero cold-start latency during the flash sale, maintaining sub-100ms response times throughout the event. + +### Use Case 2: Post-Deployment Performance Consistency +**Scenario**: A team deploys new subgraph versions multiple times daily, each deployment restarting routers and clearing caches. +**How it works**: Cache Warmer is configured at the namespace level. Each time a subgraph is published and routers update their configuration, the cache warmer precomputes plans for known slow queries. +**Outcome**: Users experience consistent performance regardless of deployment frequency. + +### Use Case 3: Live Event Traffic Handling +**Scenario**: A streaming platform expects a massive traffic spike when a popular show releases new episodes. +**How it works**: The platform manually adds critical video metadata queries using `wgc router cache push`. Combined with automatically detected slow queries, the full query set is pre-warmed. +**Outcome**: The release event handles 5x normal traffic with no performance degradation. + +--- + +## Technical Summary + +### How It Works +The Cache Warmer operates in three phases: +1. **Query Identification**: Telemetry data identifies high-latency operations using P90 latency measurements. +2. **Manifest Building**: The system compiles a manifest of queries to warm, stored in the CDN. +3. **Precomputation**: At router startup (and after configuration updates), the router fetches the manifest and precomputes query plans. + +### Key Technical Features +- P90 latency-based query prioritization +- LIFO (Last-In, First-Out) policy for operation management +- Configurable maximum number of warmed operations +- Manual operation addition via `wgc router cache push` +- Studio-based manual recomputation triggers + +### Integration Points +- Cosmo Studio for configuration and monitoring +- OpenTelemetry for latency metrics collection +- Cosmo CDN for manifest storage +- CLI (`wgc`) for manual operation management + +### Requirements & Prerequisites +- Enterprise plan subscription +- Telemetry enabled with `wg.operation.hash` attribute +- Router configuration with `cache_warmup.enabled: true` + +--- + +## Configuration Example + +```yaml +cache_warmup: + enabled: true + +telemetry: + metrics: + attributes: + - key: "wg.operation.hash" + value_from: + context_field: operation_hash +``` + +--- + +## Documentation References + +- Primary docs: `/docs/concepts/cache-warmer` +- Router configuration: `/docs/router/configuration#cache-warmer` +- CLI reference: `/docs/cli/router/cache/push` + +--- + +## Keywords & SEO + +### Primary Keywords +- Cache warming +- Query plan cache +- GraphQL performance optimization + +### Secondary Keywords +- Cold start latency +- Pre-computed query plans +- Federation performance + +### Related Search Terms +- How to eliminate GraphQL cold start +- Preload GraphQL cache +- GraphQL flash sale performance + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/performance/performance-debugging.md b/capabilities/performance/performance-debugging.md new file mode 100644 index 00000000..60a6f2cb --- /dev/null +++ b/capabilities/performance/performance-debugging.md @@ -0,0 +1,162 @@ +# Performance Debugging + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-perf-005` | +| **Category** | Performance | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-perf-003` (Cache Warmer) | + +--- + +## Quick Reference + +### Name +Performance Debugging + +### Tagline +Identify and resolve GraphQL performance bottlenecks. + +### Elevator Pitch +Performance Debugging provides deep visibility into every phase of GraphQL request processing through OpenTelemetry tracing. See exactly how much time is spent on authentication, parsing, planning, and execution - and identify whether cache hits are optimizing your query plans. Pinpoint bottlenecks in minutes, not hours. + +--- + +## Problem & Solution + +### The Problem +When GraphQL requests are slow, identifying the root cause is challenging. Is the delay in query planning? Subgraph execution? Authentication? Without detailed timing breakdowns, developers resort to guesswork and time-consuming trial-and-error debugging. Federation adds complexity, as requests may touch multiple subgraphs with varying performance characteristics. + +### The Solution +Cosmo's Performance Debugging leverages OpenTelemetry to provide detailed span data for every phase of request processing. View timing breakdowns in Cosmo Studio's trace view to instantly identify whether slowness stems from planning, execution, or downstream services. Track cache hit rates to ensure query plan caching is working effectively. + +### Before & After + +| Before Cosmo | With Cosmo Performance Debugging | +|--------------|----------------------------------| +| Guessing where latency originates | Precise timing breakdown per phase | +| No visibility into planning vs execution time | Separate spans for planning and execution | +| Unknown if query plan cache is working | `enginePlanCacheHit` attribute shows cache status | +| Time-consuming performance investigations | Pinpoint bottlenecks from trace view | + +--- + +## Key Benefits + +1. **Phase-Level Visibility**: Separate timing for authentication, parsing/validation, planning, and execution phases. +2. **Cache Hit Tracking**: The `enginePlanCacheHit` attribute reveals whether query plans are being served from cache. +3. **ART Awareness**: Track when Advanced Request Tracing (ART) is enabled, which adds overhead useful for debugging but not production. +4. **Zero-Code Instrumentation**: OpenTelemetry spans are automatically generated by the router with no additional setup. +5. **Studio Integration**: View traces directly in Cosmo Studio with visual timelines and span hierarchies. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / SRE +- **Pain Points**: Slow GraphQL queries with unknown root cause, difficulty identifying federation performance issues +- **Goals**: Quickly diagnose and resolve performance bottlenecks + +### Secondary Personas +- Platform engineers optimizing router performance +- DevOps teams monitoring SLA compliance +- Engineering managers tracking system health metrics + +--- + +## Use Cases + +### Use Case 1: Cold Cache Investigation +**Scenario**: Users report intermittent slow responses that seem to resolve after the first few requests. +**How it works**: Open Cosmo Studio trace view and filter for slow operations. Check the `enginePlanCacheHit` attribute - if it shows `false`, the query plan is being computed fresh. Multiple cache misses indicate cold cache issues or cache eviction. +**Outcome**: Identification of query plan cache misses leading to implementation of Cache Warmer for critical queries. + +### Use Case 2: Subgraph Performance Isolation +**Scenario**: A complex query spanning five subgraphs has inconsistent latency. +**How it works**: View the Operation - Execution span in the trace. Expand to see individual subgraph fetch times. Identify the specific subgraph adding significant latency. +**Outcome**: Targeted optimization of the slow subgraph resolver, reducing overall query latency by 60%. + +### Use Case 3: Planning vs Execution Analysis +**Scenario**: A team needs to understand if slow queries are due to complex planning or slow data fetching. +**How it works**: Compare the Operation - Planning span duration against the Operation - Execution span. High planning time suggests schema or query complexity issues. High execution time points to subgraph or database performance. +**Outcome**: Data-driven decisions on whether to optimize the query structure or backend services. + +--- + +## Technical Summary + +### How It Works +The router automatically generates OpenTelemetry spans for each phase of request processing: +1. **Authenticate** (optional): Time spent validating requests against authentication providers +2. **Operation - Parse and Validate**: Parsing variables, query body, and schema validation +3. **Operation - Planning**: Building the optimized query plan, including normalization +4. **Operation - Execution**: Fetching data from subgraphs and aggregating responses + +### Key Technical Features +- OpenTelemetry-compatible span generation +- `enginePlanCacheHit` attribute for cache monitoring +- `engineRequestTracingEnabled` attribute for ART detection +- Integration with Cosmo Studio trace viewer +- Export to external observability platforms (Jaeger, Zipkin, etc.) + +### Integration Points +- Cosmo Studio for native trace visualization +- OpenTelemetry collectors +- Jaeger, Zipkin, and other tracing backends +- Prometheus/Grafana for metrics dashboards + +### Requirements & Prerequisites +- Router with OpenTelemetry enabled (default configuration) +- Access to Cosmo Studio or external tracing backend +- Understanding of GraphQL request lifecycle + +--- + +## Span Reference + +| Span Name | Description | Key Attributes | +|-----------|-------------|----------------| +| **Authenticate** | Authentication provider validation (optional) | Error status indicates unauthorized | +| **Operation - Parse and Validate** | Query parsing and schema validation | - | +| **Operation - Planning** | Query plan construction | `enginePlanCacheHit`, `engineRequestTracingEnabled` | +| **Operation - Execution** | Subgraph data fetching and aggregation | Per-subgraph timing | + +--- + +## Documentation References + +- Primary docs: `/docs/router/performance-debugging` +- Tracing configuration: `/docs/router/observability/tracing` +- Advanced Request Tracing: `/docs/router/advanced-request-tracing-art` +- Studio tracing view: `/docs/studio/tracing` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL performance debugging +- Query plan tracing +- OpenTelemetry GraphQL + +### Secondary Keywords +- Federation performance +- Subgraph latency +- Cache hit monitoring + +### Related Search Terms +- How to debug slow GraphQL queries +- GraphQL tracing setup +- Federation performance optimization + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/performance/persisted-operations.md b/capabilities/performance/persisted-operations.md new file mode 100644 index 00000000..d6ee5eaf --- /dev/null +++ b/capabilities/performance/persisted-operations.md @@ -0,0 +1,147 @@ +# Persisted Operations + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-perf-001` | +| **Category** | Performance | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-perf-002` (APQ) | + +--- + +## Quick Reference + +### Name +Persisted Operations + +### Tagline +Pre-register trusted operations for security and performance. + +### Elevator Pitch +Persisted Operations allow you to pre-register GraphQL queries, mutations, and subscriptions with your federated graph. Clients send only a hash identifier instead of the full operation body, reducing bandwidth, improving performance, and enabling a security safe-list that blocks unauthorized operations. + +--- + +## Problem & Solution + +### The Problem +In production GraphQL environments, sending full query bodies with every request wastes bandwidth and exposes your API surface to arbitrary operations. Teams struggle to control which operations can be executed against their graph, and without a whitelist mechanism, any valid GraphQL query can be run - including potentially expensive or malicious ones. + +### The Solution +Cosmo's Persisted Operations let you register trusted operations during your CI/CD pipeline. The control plane stores these operations and replicates them to the CDN for router access. Clients reference operations by hash, reducing payload size and enabling strict operation allowlists in production. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Full query bodies sent with every request | Only hash identifiers transmitted | +| Any arbitrary operation can be executed | Only registered operations allowed | +| No visibility into which operations clients use | Full audit trail of registered operations | +| Larger request payloads increase latency | Minimal payloads improve response times | + +--- + +## Key Benefits + +1. **Reduced Bandwidth**: Clients send only a short hash instead of potentially large query bodies, significantly reducing network overhead. +2. **Enhanced Security**: Block non-registered operations to prevent unauthorized or malicious queries from executing against your graph. +3. **Audit Trail**: Track exactly which operations are registered and used by each client application. +4. **CI/CD Integration**: Register operations automatically during your release pipeline, ensuring development flexibility while maintaining production security. +5. **Client-Specific Operations**: Associate operations with specific client applications using the `graphql-client-name` header. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Security Engineer +- **Pain Points**: Controlling API access, preventing abuse, reducing attack surface +- **Goals**: Lock down production to only trusted operations while maintaining developer velocity + +### Secondary Personas +- Frontend developers who want faster API responses +- DevOps engineers integrating GraphQL security into CI/CD pipelines +- Security teams requiring operation allowlists + +--- + +## Use Cases + +### Use Case 1: Production Security Lockdown +**Scenario**: An e-commerce company wants to ensure only vetted operations run against their production federated graph. +**How it works**: During the release pipeline, the team runs `wgc operations push` to register all client operations. The router is configured with `block_non_persisted_operations: true`. Any request without a registered hash is rejected. +**Outcome**: Only pre-approved operations execute in production, eliminating the risk of arbitrary query execution. + +### Use Case 2: Bandwidth Optimization for Mobile Apps +**Scenario**: A mobile application sends complex GraphQL queries that increase data usage and latency over cellular networks. +**How it works**: The mobile client library generates operation hashes. Operations are pushed during the app's build process. At runtime, the app sends only the hash identifier. +**Outcome**: Request payloads shrink from kilobytes to bytes, improving load times and reducing user data consumption. + +### Use Case 3: Incremental Migration to Strict Mode +**Scenario**: A team wants to adopt persisted operations gradually without breaking existing clients. +**How it works**: First, enable `log_unknown` to identify non-persisted operations in logs. Then enable `safelist` mode to allow operations matching persisted bodies. Finally, enable full blocking once all clients are migrated. +**Outcome**: Safe, incremental rollout of operation restrictions with zero client disruption. + +--- + +## Technical Summary + +### How It Works +Operations are registered via the `wgc operations push` command during CI/CD, which sends them to the control plane. The control plane replicates operations to the Cosmo CDN. Routers fetch operations from the CDN and validate incoming requests against the registered hashes. Clients must send a SHA-256 hash that matches a registered operation. + +### Key Technical Features +- SHA-256 hash-based operation identification +- Client-specific operation namespacing via `graphql-client-name` header +- Multiple enforcement modes: log-only, safelist, full blocking +- CDN-backed operation storage for fast router access +- JSON output support for CI/CD tooling integration + +### Integration Points +- CI/CD pipelines (GitHub Actions, GitLab CI, etc.) +- GraphQL client libraries (Apollo, urql, Relay) +- Cosmo CDN and Control Plane + +### Requirements & Prerequisites +- Cosmo CLI (`wgc`) for pushing operations +- Client-side tooling to generate operation manifests +- Router configuration to enable persisted operations enforcement + +--- + +## Documentation References + +- Primary docs: `/docs/router/persisted-queries` +- Persisted operations guide: `/docs/router/persisted-queries/persisted-operations` +- CLI reference: `/docs/cli/operations/push` +- Tutorial: `/docs/tutorial/using-persisted-operations` +- Router security configuration: `/docs/router/configuration#security` + +--- + +## Keywords & SEO + +### Primary Keywords +- Persisted operations +- Trusted documents +- GraphQL safe-list + +### Secondary Keywords +- Operation allowlist +- Query whitelist +- GraphQL security + +### Related Search Terms +- How to secure GraphQL API +- Block arbitrary GraphQL queries +- GraphQL operation registration + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/proxy/file-upload.md b/capabilities/proxy/file-upload.md new file mode 100644 index 00000000..b65c24dd --- /dev/null +++ b/capabilities/proxy/file-upload.md @@ -0,0 +1,148 @@ +# File Upload + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-proxy-005` | +| **Category** | Proxy | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-proxy-001` (Request Header Operations) | + +--- + +## Quick Reference + +### Name +File Upload + +### Tagline +GraphQL multipart file uploads through the router. + +### Elevator Pitch +File Upload enables your federated GraphQL API to handle file uploads using the industry-standard GraphQL multipart request specification. Support single and multiple file uploads through your router without custom infrastructure, using the same GraphQL operations pattern your clients already know. + +--- + +## Problem & Solution + +### The Problem +File uploads in GraphQL have historically been awkward. The GraphQL specification does not define how to handle binary data, leading teams to build separate REST endpoints, use base64 encoding (inefficient), or implement proprietary solutions. In federated architectures, the challenge compounds - how do you route file uploads through the gateway to the correct subgraph while maintaining the GraphQL experience? + +### The Solution +Cosmo Router implements the GraphQL multipart request specification, the de facto standard for GraphQL file uploads. Clients send files using standard multipart/form-data encoding, with a mapping that associates files to GraphQL variables. The router handles the multipart parsing and forwards files to the appropriate subgraph. Define a simple `Upload` scalar in your schema, and you are ready to accept files. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Separate REST endpoints for file uploads | Unified GraphQL API for everything | +| Base64 encoding with 33% overhead | Efficient binary multipart transfer | +| Custom file routing logic | Automatic federation routing | +| Proprietary upload implementations | Standards-compliant specification | + +--- + +## Key Benefits + +1. **Standards Compliant**: Implements the GraphQL multipart request specification used by Apollo, Relay, and major GraphQL clients +2. **Single and Multiple Files**: Support uploading one file or many in a single mutation +3. **Unified API**: Files upload through the same GraphQL endpoint as all other operations +4. **Configurable Limits**: Control max file size, number of files, and other upload parameters +5. **Type-Safe**: Use the `Upload` scalar in your schema for clear API contracts + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / API Designer +- **Pain Points**: Needs file upload capability in GraphQL API; wants to avoid maintaining separate REST endpoints; requires control over upload limits +- **Goals**: Build comprehensive GraphQL APIs; handle user-generated content; maintain consistent API patterns + +### Secondary Personas +- Frontend developers integrating file uploads +- Mobile developers building image/document upload features +- Platform engineers configuring upload policies + +--- + +## Use Cases + +### Use Case 1: User Avatar Upload +**Scenario**: Your application allows users to upload profile photos through your GraphQL API. +**How it works**: Define an `Upload` scalar and a `updateAvatar(file: Upload!): User!` mutation. Clients use a multipart-capable GraphQL client to send the image file with the mutation. The router parses the multipart request and forwards it to the user service subgraph. +**Outcome**: Profile photo uploads work through the same GraphQL API as all other user operations, with consistent authentication and error handling. + +### Use Case 2: Document Attachment in Mutations +**Scenario**: Users need to attach multiple documents when creating a support ticket. +**How it works**: Define a mutation `createTicket(description: String!, attachments: [Upload!]!): Ticket!`. Clients map multiple files to the attachments variable in the multipart request. The router handles parsing and routing to the ticket service. +**Outcome**: Rich document upload experience without building separate file handling infrastructure. + +### Use Case 3: Bulk Import via File +**Scenario**: Administrators import data by uploading CSV or Excel files through the admin interface. +**How it works**: Create an `importData(file: Upload!): ImportResult!` mutation in your admin subgraph. The router forwards the uploaded file, the subgraph parses and processes it, returning import statistics. +**Outcome**: File-based data import integrated into your GraphQL admin API with proper authentication and authorization. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router accepts `multipart/form-data` requests following the GraphQL multipart request specification. The request contains three key parts: `operations` (the GraphQL operation with file variables set to null), `map` (associations between form fields and variable paths), and the file fields themselves. The router parses this structure, associates files with their variables, and forwards the request to the appropriate subgraph. + +### Key Technical Features +- GraphQL multipart request specification compliant +- Support for single file uploads +- Support for multiple file uploads in one operation +- Configurable file size limits +- Configurable maximum number of files +- Custom `Upload` scalar type +- Automatic routing to appropriate subgraph + +### Integration Points +- Works with Apollo Client, Relay, urql, and other spec-compliant clients +- Compatible with popular backend libraries (graphql-upload, etc.) +- Integrates with existing authentication and authorization + +### Requirements & Prerequisites +- Define `Upload` scalar in subgraph schema +- Implement file handling in subgraph resolvers +- Use multipart-capable GraphQL client +- Configure router upload limits as needed + +--- + +## Documentation References + +- Primary docs: `/docs/router/file-upload` +- Configuration options: `/docs/router/configuration#file-upload` +- GraphQL multipart spec: `https://github.com/jaydenseric/graphql-multipart-request-spec` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL file upload +- Multipart GraphQL request +- Federation file upload + +### Secondary Keywords +- GraphQL binary upload +- Upload scalar GraphQL +- GraphQL image upload + +### Related Search Terms +- How to upload files with GraphQL +- GraphQL multipart request specification +- File upload through GraphQL gateway + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/proxy/forward-client-extensions.md b/capabilities/proxy/forward-client-extensions.md new file mode 100644 index 00000000..dc1017a9 --- /dev/null +++ b/capabilities/proxy/forward-client-extensions.md @@ -0,0 +1,150 @@ +# Forward Client Extensions + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-proxy-003` | +| **Category** | Proxy | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-proxy-001` (Request Header Operations), `cap-proxy-004` (Override Subgraph Config) | + +--- + +## Quick Reference + +### Name +Forward Client Extensions + +### Tagline +Propagate extension fields from clients to subgraphs. + +### Elevator Pitch +Forward Client Extensions enables you to pass arbitrary data from clients through the router to your subgraphs using the standard GraphQL `extensions` field. Perfect for sending authentication tokens, feature flags, or custom metadata - and essential for subscription initialization data that cannot be sent via headers. + +--- + +## Problem & Solution + +### The Problem +Sometimes headers are not enough. Clients need to send structured data, tokens, or metadata that does not fit naturally into HTTP headers. For WebSocket-based subscriptions, the challenge is even greater - initial connection handshake data cannot be modified after connection establishment, making it impossible to send per-operation tokens via headers. Teams end up implementing custom workarounds or losing flexibility in their API design. + +### The Solution +Cosmo Router supports the `extensions` field as defined in the GraphQL over HTTP specification. By default, any `extensions` data sent by clients is automatically forwarded to all subgraphs. For subscriptions, the `extensions` field in the subscription payload provides the only reliable way to pass initialization data. This standards-compliant approach enables flexible data passing without custom infrastructure. + +--- + +## Problem & Solution + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Custom endpoints for passing extra data | Standard GraphQL extensions field | +| No way to send per-subscription tokens | Extensions in subscription payload | +| Header limitations for structured data | JSON-structured extensions object | +| Non-standard workarounds | Spec-compliant implementation | + +--- + +## Key Benefits + +1. **Standards Compliant**: Implements the GraphQL over HTTP specification for extensions +2. **Zero Configuration**: Extensions are forwarded by default - works out of the box +3. **Subscription Support**: The only reliable method for sending per-subscription initialization data +4. **Flexible Data Structure**: Send any JSON structure, not limited to string header values +5. **Universal Compatibility**: Works with queries, mutations, and subscriptions + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / API Designer +- **Pain Points**: Needs to pass structured data from clients to subgraphs; requires per-subscription authentication; wants to avoid custom header gymnastics +- **Goals**: Design flexible APIs; implement proper subscription authentication; pass feature flags and metadata cleanly + +### Secondary Personas +- Frontend developers sending client context +- Security engineers implementing subscription authentication +- Platform engineers designing API contracts + +--- + +## Use Cases + +### Use Case 1: Subscription Authentication Tokens +**Scenario**: Your subscription endpoints require authentication, but WebSocket connections are established before you know which subscription will run. You need to pass a token with each subscription operation. +**How it works**: Clients include a token in the subscription payload extensions: `{"extensions":{"token":"user-auth-token"}}`. The router forwards this to the subgraph handling the subscription, which validates the token. +**Outcome**: Per-subscription authentication without modifying WebSocket connection logic or compromising security. + +### Use Case 2: Feature Flags and A/B Testing +**Scenario**: Your subgraphs implement feature flags, and you need to pass the client's feature flag context with each request. +**How it works**: Clients send their feature flag state in extensions: `{"extensions":{"features":{"newCheckout":true,"betaSearch":false}}}`. Subgraphs receive this context and adjust behavior accordingly. +**Outcome**: Seamless feature flag propagation through your federated graph without header complexity. + +### Use Case 3: Client Metadata for Analytics +**Scenario**: Your backend needs client metadata (app version, platform, session ID) for analytics and debugging, structured as JSON rather than individual headers. +**How it works**: Clients include metadata in the extensions object: `{"extensions":{"client":{"version":"2.1.0","platform":"ios","sessionId":"abc123"}}}`. Subgraphs log or process this data as needed. +**Outcome**: Rich client context available to all subgraphs in a structured, easy-to-parse format. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router examines the `extensions` field in incoming GraphQL requests. For queries and mutations, this is part of the standard HTTP JSON body. For subscriptions using the graphql-ws protocol, extensions are included in the subscription message payload. The router automatically includes these extensions in all subgraph requests, preserving the original structure. + +### Key Technical Features +- Automatic forwarding of the `extensions` field to all subgraphs +- Support for queries, mutations, and subscriptions +- JSON structure preserved exactly as sent by client +- Compatible with graphql-ws subscription protocol +- No configuration required - enabled by default + +### Integration Points +- Works with all GraphQL client libraries that support extensions +- Compatible with graphql-ws subscription protocol +- Integrates with subgraph authorization middleware + +### Requirements & Prerequisites +- Subgraphs must be designed to read and process extensions +- Clients must use GraphQL libraries that support extensions field +- For subscriptions: graphql-ws compatible WebSocket setup + +--- + +## Documentation References + +- Primary docs: `/docs/router/proxy-capabilities/forward-client-extensions` +- Subscriptions guide: `/docs/router/subscriptions` +- Subscription extensions: `/docs/router/subscriptions#using-the-extensions-field` +- GraphQL over HTTP spec: `https://github.com/graphql/graphql-over-http` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL extensions field +- Forward extensions to subgraphs +- Subscription initialization data + +### Secondary Keywords +- GraphQL client metadata +- Federation extensions support +- WebSocket subscription tokens + +### Related Search Terms +- How to pass data to GraphQL subgraphs +- GraphQL subscription authentication +- GraphQL over HTTP extensions + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/proxy/override-subgraph-config.md b/capabilities/proxy/override-subgraph-config.md new file mode 100644 index 00000000..5e86fbe1 --- /dev/null +++ b/capabilities/proxy/override-subgraph-config.md @@ -0,0 +1,146 @@ +# Override Subgraph Config + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-proxy-004` | +| **Category** | Proxy | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-proxy-001` (Request Header Operations), `cap-proxy-003` (Forward Client Extensions) | + +--- + +## Quick Reference + +### Name +Override Subgraph Config + +### Tagline +Dynamic runtime subgraph configuration without redeployment. + +### Elevator Pitch +Override Subgraph Config lets you change subgraph routing URLs and subscription settings at runtime without modifying the router execution config. Perfect for Kubernetes deployments where you need cluster-local DNS names, or any scenario where network topology differs between environments. + +--- + +## Problem & Solution + +### The Problem +In production environments, the subgraph URLs registered in the control plane often differ from where the router should actually send traffic. Kubernetes clusters use internal DNS names. Development environments point to localhost. Staging uses different ports. Teams end up maintaining multiple router configs, implementing complex environment variable substitution, or building custom routing logic - all to solve a simple address translation problem. + +### The Solution +Cosmo Router's override configuration lets you define local routing URLs that take precedence over the execution config. Override just the routing URL, or fully customize subscription URLs and protocols per subgraph. The control plane maintains canonical URLs for public visibility while your router uses the addresses that match your network topology. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Multiple router configs per environment | Single config with local overrides | +| Complex URL templating systems | Simple YAML override rules | +| Mismatch between control plane and runtime | Clear separation of concerns | +| Manual coordination of URL changes | Independent environment configuration | + +--- + +## Key Benefits + +1. **Environment Flexibility**: Use different URLs per environment without changing the control plane configuration +2. **Kubernetes Native**: Leverage cluster-local DNS names while keeping public URLs in the registry +3. **Full Protocol Control**: Override not just URLs but subscription protocols and WebSocket subprotocols +4. **Simple Configuration**: Straightforward YAML syntax with per-subgraph granularity +5. **No Redeployment Required**: Change routing at runtime by updating config + +--- + +## Target Audience + +### Primary Persona +- **Role**: DevOps Engineer / Platform Engineer +- **Pain Points**: Needs different routing URLs per environment; wants to use Kubernetes internal DNS; requires subscription protocol flexibility +- **Goals**: Simplify multi-environment deployments; optimize network paths; maintain clear configuration separation + +### Secondary Personas +- Backend developers working in local development +- SREs managing production routing +- Infrastructure teams designing network architecture + +--- + +## Use Cases + +### Use Case 1: Kubernetes Internal Routing +**Scenario**: Your subgraphs are deployed in the same Kubernetes cluster as the router. The control plane knows the public URLs, but you want the router to use cluster-internal DNS names for lower latency and to avoid egress costs. +**How it works**: Configure overrides for each subgraph: `routing_url: http://products-service.default.svc.cluster.local:3002/graphql`. The router uses these internal URLs while the control plane maintains the external URLs for schema validation and external access. +**Outcome**: Reduced network latency, eliminated egress costs, and no NAT hairpinning - while maintaining a single source of truth in the control plane. + +### Use Case 2: Subscription Protocol Migration +**Scenario**: You are migrating a subgraph from SSE subscriptions to WebSocket-based subscriptions, but need to deploy gradually without changing the control plane config. +**How it works**: Override the subscription settings for the specific subgraph: `subscription_url`, `subscription_protocol: ws`, and `subscription_websocket_subprotocol: graphql-ws`. Test the new protocol in staging before updating the control plane. +**Outcome**: Safe protocol migration with ability to test and rollback at the router level before committing changes to the control plane. + +### Use Case 3: Local Development Environment +**Scenario**: Developers run a subset of subgraphs locally while pointing to shared staging services for others. They need to route some traffic to localhost without affecting the shared configuration. +**How it works**: Each developer maintains a local config override: `routing_url: http://localhost:4001/graphql` for the subgraphs they are developing. The router merges these overrides with the production execution config. +**Outcome**: Developers can work on individual subgraphs locally while still participating in the full federated graph. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router loads its execution config from the control plane, which includes routing URLs for all subgraphs. When override configuration is present, the router replaces the execution config URLs with the override values before routing requests. Overrides can specify routing URLs, subscription URLs, subscription protocols (ws, sse, sse_post), and WebSocket subprotocols (graphql-ws, graphql-transport-ws, auto). + +### Key Technical Features +- Override routing URLs per subgraph +- Override subscription URLs independently from query/mutation URLs +- Configure subscription protocol (ws, sse, sse_post) +- Set WebSocket subprotocol (graphql-ws, graphql-transport-ws, auto) +- Backward-compatible legacy syntax still supported +- Merges with execution config at runtime + +### Integration Points +- Works with all subgraph types +- Compatible with Kubernetes service discovery +- Supports all subscription protocols + +### Requirements & Prerequisites +- Cosmo Router with config.yaml access +- Understanding of your network topology +- Knowledge of subgraph subscription capabilities + +--- + +## Documentation References + +- Primary docs: `/docs/router/proxy-capabilities/override-subgraph-config` +- Configuration guide: `/docs/router/configuration#config-file` +- Subscriptions configuration: `/docs/router/subscriptions` + +--- + +## Keywords & SEO + +### Primary Keywords +- Subgraph routing override +- GraphQL federation routing +- Kubernetes GraphQL routing + +### Secondary Keywords +- Dynamic subgraph configuration +- Subscription URL override +- Federation environment configuration + +### Related Search Terms +- How to override subgraph URL in Cosmo +- Kubernetes internal routing for GraphQL +- Change subgraph address at runtime + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/proxy/request-header-operations.md b/capabilities/proxy/request-header-operations.md new file mode 100644 index 00000000..2e6ec888 --- /dev/null +++ b/capabilities/proxy/request-header-operations.md @@ -0,0 +1,149 @@ +# Request Header Operations + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-proxy-001` | +| **Category** | Proxy | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-proxy-002` (Response Header Operations), `cap-proxy-003` (Forward Client Extensions) | + +--- + +## Quick Reference + +### Name +Request Header Operations + +### Tagline +Inject and control headers flowing to your subgraphs. + +### Elevator Pitch +Request Header Operations gives you complete control over HTTP headers sent from the Cosmo Router to your subgraphs. Propagate client headers, set static values, or dynamically compute headers using template expressions - all through simple YAML configuration without writing code. + +--- + +## Problem & Solution + +### The Problem +In federated GraphQL architectures, subgraphs often need context from the original client request - authentication tokens, user IDs, correlation IDs, or custom metadata. Without proper header management, teams either expose all headers (security risk) or manually implement forwarding logic in each subgraph. This leads to inconsistent security postures, duplicated code, and difficulty maintaining cross-cutting concerns like tracing and authentication. + +### The Solution +Cosmo Router provides declarative header operations that let you precisely control which headers reach your subgraphs. Use exact matching or regex patterns to propagate client headers, set static values for service-to-service authentication, or use dynamic expressions to compute header values based on request context. Apply rules globally or per-subgraph, with predictable ordering and sensible defaults. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| All headers forwarded, creating security exposure | Explicit allowlist with propagate rules | +| Custom middleware needed for header manipulation | Declarative YAML configuration | +| Inconsistent header handling across services | Centralized, uniform header policies | +| No dynamic header computation | Template expressions for context-aware values | + +--- + +## Key Benefits + +1. **Security by Default**: No headers are forwarded automatically - you explicitly define what reaches subgraphs +2. **Flexible Matching**: Use exact names or regex patterns to propagate headers, with support for negation +3. **Dynamic Values**: Compute header values using template expressions with access to authentication claims and request context +4. **Per-Subgraph Control**: Apply different header rules to different subgraphs while maintaining global defaults +5. **Zero Code Required**: All configuration through YAML - no custom middleware or code changes needed + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Gateway Administrator +- **Pain Points**: Needs to enforce consistent header policies across all subgraphs; wants to avoid custom code for header manipulation; requires security control over what data reaches backend services +- **Goals**: Implement secure, maintainable header forwarding; enable authentication context propagation; support tracing and observability headers + +### Secondary Personas +- Backend developers who need specific headers in their subgraphs +- Security engineers reviewing data flow between services +- DevOps engineers implementing service mesh patterns + +--- + +## Use Cases + +### Use Case 1: Authentication Context Propagation +**Scenario**: Your subgraphs need the authenticated user's ID to authorize data access, but you use JWT authentication at the router level. +**How it works**: Configure a dynamic header using template expressions: `expression: "request.auth.isAuthenticated ? request.auth.claims.sub : ''"`. The router extracts the user ID from the validated JWT and forwards it as a header. +**Outcome**: Subgraphs receive user identity without parsing JWTs themselves, simplifying authorization logic and improving security. + +### Use Case 2: Service-to-Service Authentication +**Scenario**: Your subgraphs require a shared secret header to verify requests come from the trusted router, not directly from external sources. +**How it works**: Use the `set` operation to add a static secret header: `name: "X-Internal-Auth"` with `value: "your-secret-key"`. Apply this globally or to specific subgraphs. +**Outcome**: Subgraphs can verify request authenticity, enabling zero-trust networking within your infrastructure. + +### Use Case 3: Correlation ID Forwarding +**Scenario**: For distributed tracing, you need to propagate correlation IDs from client requests through to all subgraphs. +**How it works**: Configure a propagate rule with regex matching: `matching: (?i)^X-Correlation-.*` to forward all correlation headers. Set a `default` value for cases where clients don't provide one. +**Outcome**: Complete request tracing across your federated graph with consistent correlation IDs. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router intercepts all incoming requests and applies configured header rules before forwarding to subgraphs. Rules are evaluated in order, with support for propagation (forwarding client headers), setting (adding new headers), and transformation (renaming headers). Template expressions provide access to request context including authentication claims. + +### Key Technical Features +- Exact name matching with `named` parameter +- Regex pattern matching with `matching` parameter (Go regex syntax) +- Negation support with `negate_match` for inverse matching +- Header renaming with `rename` parameter +- Default values when headers are missing +- Template expressions for dynamic computation +- Per-subgraph override capability +- Automatic header canonicalization handling + +### Integration Points +- Works with all subgraph types (HTTP, WebSocket) +- Integrates with router authentication for accessing claims in expressions +- Compatible with Custom Modules for advanced use cases + +### Requirements & Prerequisites +- Cosmo Router with config.yaml access +- Understanding of HTTP header semantics +- For dynamic expressions: familiarity with template expression syntax + +--- + +## Documentation References + +- Primary docs: `/docs/router/proxy-capabilities/request-headers-operations` +- Configuration guide: `/docs/router/configuration#config-file` +- Template expressions: `/docs/router/configuration/template-expressions` +- Custom Modules: `/docs/router/custom-modules` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL header forwarding +- Request header propagation +- API gateway header manipulation + +### Secondary Keywords +- Subgraph authentication +- HTTP header rules +- Federation header management + +### Related Search Terms +- How to forward headers in GraphQL federation +- GraphQL router header configuration +- Propagate authentication headers to subgraphs + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/proxy/response-header-operations.md b/capabilities/proxy/response-header-operations.md new file mode 100644 index 00000000..03c662c8 --- /dev/null +++ b/capabilities/proxy/response-header-operations.md @@ -0,0 +1,148 @@ +# Response Header Operations + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-proxy-002` | +| **Category** | Proxy | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-proxy-001` (Request Header Operations), `cap-proxy-003` (Forward Client Extensions) | + +--- + +## Quick Reference + +### Name +Response Header Operations + +### Tagline +Control response headers from subgraphs to clients. + +### Elevator Pitch +Response Header Operations gives you precise control over which headers from your subgraphs reach the client. Choose from multiple propagation algorithms to handle headers from parallel subgraph requests, set custom response headers, and maintain tight control over caching directives and security headers. + +--- + +## Problem & Solution + +### The Problem +In federated GraphQL, a single client request often fans out to multiple subgraphs that execute in parallel. Each subgraph may return its own response headers - cache directives, rate limit info, custom metadata. Without intelligent aggregation, you either lose important headers or face unpredictable behavior when the same header comes from multiple sources. Traditional solutions require complex custom code to merge headers correctly. + +### The Solution +Cosmo Router provides declarative response header operations with multiple propagation algorithms. Choose `first_write` to keep the first value, `last_write` to use the most recent, or `append` to combine values from all subgraphs. Apply rules globally or per-subgraph, use regex patterns for flexible matching, and set static headers for consistent client responses. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Random header selection from parallel responses | Deterministic algorithms (first_write, last_write, append) | +| Lost cache headers from subgraphs | Controlled propagation of cache directives | +| Custom code needed for header merging | Declarative YAML configuration | +| Inconsistent response headers | Uniform, predictable header policies | + +--- + +## Key Benefits + +1. **Deterministic Behavior**: Choose from three algorithms to handle headers from multiple subgraphs predictably +2. **Cache Control**: Properly propagate cache headers from subgraphs to enable client-side and CDN caching +3. **Security by Default**: No response headers forwarded unless explicitly configured +4. **Flexible Matching**: Use exact names or regex patterns with optional negation +5. **Per-Subgraph Rules**: Apply different propagation rules to different subgraphs when needed + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / API Gateway Administrator +- **Pain Points**: Needs predictable header behavior from federated responses; wants to enable proper caching; requires control over security headers sent to clients +- **Goals**: Implement consistent response header policies; enable CDN caching; ensure security headers reach clients correctly + +### Secondary Personas +- Frontend developers who rely on specific response headers +- Performance engineers optimizing caching strategies +- Security engineers managing response header policies + +--- + +## Use Cases + +### Use Case 1: Cache Header Propagation +**Scenario**: Your subgraphs return `Cache-Control` headers, and you need to propagate appropriate caching directives to clients and CDNs. +**How it works**: Configure a propagate rule with `named: "Cache-Control"` and `algorithm: "first_write"` (to use the most restrictive first response) or `last_write` (to use the final response's directive). +**Outcome**: Clients and CDNs receive consistent cache directives, enabling proper caching and reducing unnecessary requests. + +### Use Case 2: Aggregating Rate Limit Information +**Scenario**: Multiple subgraphs return rate limit headers, and you want to provide clients with a combined view of remaining quota. +**How it works**: Use the `append` algorithm on rate limit headers: `matching: (?i)^X-RateLimit-.*` with `algorithm: "append"`. Headers from all subgraphs are combined into comma-separated values. +**Outcome**: Clients receive comprehensive rate limit information from all services in a single response. + +### Use Case 3: Security Header Injection +**Scenario**: You need to ensure certain security headers are always present in responses, regardless of what subgraphs return. +**How it works**: Use the `set` operation to add required headers: `name: "X-Content-Type-Options"` with `value: "nosniff"`. Apply globally to ensure consistent security posture. +**Outcome**: All client responses include required security headers, meeting compliance requirements without modifying subgraphs. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router collects response headers from all subgraph responses during federated execution. For each configured propagation rule, it applies the specified algorithm to determine the final header value. The `first_write` algorithm keeps the first encountered value, `last_write` uses the last value, and `append` combines all values with comma separation. Static headers via `set` are added after propagation. + +### Key Technical Features +- Three propagation algorithms: `first_write`, `last_write`, `append` +- Exact name matching with `named` parameter +- Regex pattern matching with `matching` parameter +- Negation support with `negate_match` +- Header renaming with `rename` parameter +- Default values when headers are missing +- Per-subgraph rule overrides +- Automatic hop-by-hop header filtering + +### Integration Points +- Works with all response types (queries, mutations, subscriptions) +- Compatible with CDN caching strategies +- Integrates with client-side caching mechanisms + +### Requirements & Prerequisites +- Cosmo Router with config.yaml access +- Understanding of HTTP response header semantics +- Awareness of subgraph response header behavior + +--- + +## Documentation References + +- Primary docs: `/docs/router/proxy-capabilities/response-header-operations` +- Configuration guide: `/docs/router/configuration#config-file` +- Request header operations: `/docs/router/proxy-capabilities/request-headers-operations` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL response headers +- Federated response header propagation +- API gateway response headers + +### Secondary Keywords +- Cache header forwarding +- Response header aggregation +- GraphQL caching headers + +### Related Search Terms +- How to propagate headers from subgraphs +- GraphQL federation response headers +- Merge headers from multiple GraphQL services + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/real-time/cosmo-streams.md b/capabilities/real-time/cosmo-streams.md new file mode 100644 index 00000000..91cd925f --- /dev/null +++ b/capabilities/real-time/cosmo-streams.md @@ -0,0 +1,211 @@ +# Cosmo Streams (EDFS) + +Event-driven federated subscriptions with Kafka, NATS, and Redis integration. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-realtime-002` | +| **Category** | Real-Time | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-realtime-001` (GraphQL Subscriptions) | + +--- + +## Quick Reference + +### Name +Cosmo Streams (EDFS) + +### Tagline +Scalable subscriptions powered by your event infrastructure. + +### Elevator Pitch +Cosmo Streams fundamentally reimagines GraphQL subscriptions by connecting the Router directly to message brokers like Kafka, NATS, and Redis. Subgraphs remain completely stateless - they simply emit events to your existing messaging infrastructure. The Router handles all subscription state, client connections, and data resolution, enabling serverless deployments and dramatic scalability improvements. + +--- + +## Problem & Solution + +### The Problem +Traditional GraphQL subscriptions require subgraphs to maintain long-lived WebSocket connections, implement subscription loops, and track active subscriptions in memory. This stateful architecture prevents serverless deployments, creates tight coupling between graph architecture and runtime environment, and consumes significant resources - up to 3 WebSocket connections per client (client-to-router, router-to-subgraph, plus internal subgraph state). At 10,000 clients, this can mean 30GB+ of memory just for connection overhead. + +### The Solution +Cosmo Streams treats subscriptions as an event-driven problem rather than a connection-driven one. Subgraphs publish events to Kafka, NATS, or Redis when data changes. The Router subscribes to these events, determines affected client subscriptions, deduplicates work, fetches required data via plain HTTP requests, and broadcasts updates to clients. All subscription state lives in the Router where it can be optimized, monitored, and scaled efficiently. + +### Before & After + +| Before Cosmo | With Cosmo Streams | +|--------------|-------------------| +| 3 WebSocket connections per client | 1 client connection, HTTP to subgraphs | +| Subgraphs must be stateful | Subgraphs are completely stateless | +| No serverless deployment option | Full serverless compatibility | +| Subscription logic in every subgraph | Zero subscription code in subgraphs | +| 30GB memory for 10k clients | ~150-200MB for 10k clients | + +--- + +## Key Benefits + +1. **Stateless Subgraphs**: Subgraphs never hold WebSocket connections or subscription state, making them ideal for Lambda, Cloud Run, or any serverless environment +2. **Massive Resource Efficiency**: 10,000 connected clients consume only 150-200MB of memory with near-zero CPU when idle, versus gigabytes with traditional subscriptions +3. **Event-Native Architecture**: Subscriptions integrate naturally with existing Kafka, NATS, or Redis infrastructure rather than introducing proprietary protocols +4. **Centralized Connection Management**: All client connections live in the Router where they can be optimized, deduplicated, and monitored efficiently +5. **Zero Subscription Logic in Subgraphs**: No WebSocket servers, no callback protocols, no proprietary APIs - subgraphs just publish events and respond to HTTP requests + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Backend Architect +- **Pain Points**: Current subscription architecture is expensive, stateful, and blocks serverless adoption; scaling is painful and costly +- **Goals**: Reduce infrastructure costs; enable serverless for all services; leverage existing event infrastructure + +### Secondary Personas +- Backend Developers who want to add real-time features without implementing WebSocket servers +- DevOps Engineers managing subscription infrastructure and connection scaling +- Engineering Managers evaluating total cost of ownership for real-time features + +--- + +## Use Cases + +### Use Case 1: Serverless Real-Time Updates +**Scenario**: A company wants to move their GraphQL subgraphs to AWS Lambda but can't because subscriptions require persistent connections +**How it works**: Subgraphs publish entity update events to NATS when data changes. The Router subscribes to relevant topics and fetches current state via HTTP when events arrive. Lambda functions handle only stateless HTTP requests. +**Outcome**: Full serverless deployment with real-time subscription support; pay only for actual compute usage + +### Use Case 2: Long-Running Job Tracking +**Scenario**: Users submit data processing jobs and need to track progress through multiple stages handled by different backend services +**How it works**: A mutation kicks off the job. Each backend service (data validation, processing, export) publishes progress events to Kafka. Clients subscribe to job status. The Router aggregates state from multiple subgraphs on each event. +**Outcome**: Users see real-time progress without polling; backend services remain decoupled and independently scalable + +### Use Case 3: High-Scale Notification System +**Scenario**: A platform needs to push updates to 100,000+ concurrent users without breaking the infrastructure budget +**How it works**: Backend services publish notifications to Redis Pub/Sub. The Router handles all client WebSocket connections with efficient epoll/kqueue I/O. Subscription deduplication ensures each unique notification is fetched only once regardless of subscriber count. +**Outcome**: 100x improvement in clients-per-server compared to traditional architecture + +### Use Case 4: Federated Entity Updates +**Scenario**: An Employee entity has fields from multiple subgraphs (HR, Payroll, Projects). Any subgraph update should notify subscribers. +**How it works**: Each subgraph publishes to the same NATS subject when their Employee fields change. The Router receives the event, resolves current field values from all contributing subgraphs via HTTP, and pushes the complete entity to subscribers. +**Outcome**: True federated real-time updates without inter-subgraph coordination + +--- + +## Competitive Positioning + +### Key Differentiators +1. Direct integration with production message brokers (Kafka, NATS, Redis) rather than proprietary protocols +2. Complete elimination of subscription state from subgraphs +3. Efficient epoll/kqueue-based I/O handling tens of thousands of connections +4. Automatic subscription deduplication and resource cleanup +5. Support for subscription filtering with dynamic conditions + +### Comparison with Alternatives + +| Aspect | Cosmo Streams | Traditional GraphQL Subscriptions | Custom Event Gateway | +|--------|---------------|----------------------------------|---------------------| +| Subgraph State | Stateless | Stateful (WebSockets) | Varies | +| Serverless Compatible | Yes | No | Requires custom work | +| Message Broker Integration | Native (Kafka, NATS, Redis) | None | DIY | +| Memory per 10k clients | ~150-200MB | ~30GB | Varies | +| Subscription Deduplication | Automatic | Manual | DIY | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We don't use Kafka/NATS/Redis" | Many organizations already have message infrastructure; NATS is lightweight to add if not | +| "We already have WebSocket subscriptions working" | Consider the TCO: Cosmo Streams can reduce infrastructure costs by 90%+ while enabling serverless | +| "Isn't this adding another moving part?" | You're replacing multiple stateful subscription systems with your existing message broker; net simplification | +| "What about subscription authorization?" | Filter subscriptions using the @openfed__subscriptionFilter directive; additional auth options available | + +--- + +## Technical Summary + +### How It Works +Event-Driven Subgraphs define schema directives (@edfs__kafkaSubscribe, @edfs__natsSubscribe, @edfs__redisSubscribe) that map subscription fields to message broker topics. The Router connects to configured brokers and subscribes to relevant topics. When a message arrives, the Router identifies affected client subscriptions, deduplicates fetch requests, resolves entity data via HTTP from subgraphs, and broadcasts results to clients over WebSocket, SSE, or Multipart HTTP. + +### Key Technical Features +- Kafka, NATS, and Redis Pub/Sub provider support +- NATS JetStream for persistent, replayable event streams +- NATS Request/Reply for synchronous event-driven queries +- Publish directives for GraphQL mutations that emit events +- Topic templating with argument interpolation (e.g., `employeeUpdated.{{ args.id }}`) +- Subscription filtering with AND/OR/NOT/IN conditions +- Provider ID configuration for multiple broker instances +- Automatic subscription deduplication +- Epoll/Kqueue I/O for efficient connection handling +- ~40 goroutines for 10k idle clients + +### Integration Points +- Apache Kafka clusters +- NATS and NATS JetStream servers +- Redis Pub/Sub +- Any service that can publish to these message systems +- Existing federated subgraphs (resolved via HTTP) + +### Requirements & Prerequisites +- Cosmo Router with events configuration +- At least one supported message broker (Kafka, NATS, or Redis) +- Event-Driven Subgraph schema definitions (no implementation needed) +- Services publishing events to the configured topics + +--- + +## Proof Points + +### Metrics & Benchmarks +- 10,000 clients connected: ~150-200MB memory, 0% CPU when idle +- 10,000 idle clients require only ~40 goroutines +- Supports multi-core scaling for 10k+ events per second +- 95%+ reduction in memory usage compared to traditional WebSocket subscriptions + +--- + +## Documentation References + +- Primary docs: `/docs/router/cosmo-streams` +- Kafka integration: `/docs/router/cosmo-streams/kafka` +- NATS integration: `/docs/router/cosmo-streams/nats` +- Redis integration: `/docs/router/cosmo-streams/redis` +- Federation concepts: `/docs/federation/event-driven-federated-subscriptions` +- Router configuration: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- Event-driven GraphQL subscriptions +- Kafka GraphQL +- NATS GraphQL + +### Secondary Keywords +- EDFS +- Cosmo Streams +- Redis GraphQL subscriptions +- Serverless GraphQL subscriptions +- Federated subscriptions + +### Related Search Terms +- GraphQL event sourcing +- Message broker GraphQL integration +- Stateless GraphQL subscriptions +- Scalable real-time GraphQL +- GraphQL Kafka integration +- GraphQL NATS integration + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/real-time/graphql-subscriptions.md b/capabilities/real-time/graphql-subscriptions.md new file mode 100644 index 00000000..152f43bf --- /dev/null +++ b/capabilities/real-time/graphql-subscriptions.md @@ -0,0 +1,195 @@ +# GraphQL Subscriptions + +Real-time GraphQL updates with multiple protocol support and connection multiplexing. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-realtime-001` | +| **Category** | Real-Time | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-realtime-002` (Cosmo Streams) | + +--- + +## Quick Reference + +### Name +GraphQL Subscriptions + +### Tagline +Real-time updates with zero limitations. + +### Elevator Pitch +Cosmo Router provides out-of-the-box subscription support with multiple protocol options including WebSockets, Server-Sent Events, and Multipart HTTP. Connection multiplexing optimizes resource usage by sharing connections across clients with identical authentication, making real-time features scalable and efficient. + +--- + +## Problem & Solution + +### The Problem +Building real-time features in GraphQL applications requires managing long-lived connections between clients and servers. Teams struggle with choosing the right protocol for their use case, handling connection overhead at scale, and ensuring compatibility between different client types and backend services. Without proper multiplexing, each client subscription opens a separate connection to backend subgraphs, leading to resource exhaustion. + +### The Solution +Cosmo Router acts as a smart subscription gateway that supports multiple real-time protocols out of the box. It automatically multiplexes client connections, routing multiple subscriptions with the same authentication through a single connection to subgraphs. Teams can choose the optimal protocol for each use case without changing their architecture. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Limited to one subscription protocol | Choose from WebSockets, SSE, or Multipart HTTP | +| Each client opens a new backend connection | Multiplexed connections share resources efficiently | +| Manual protocol negotiation logic | Automatic protocol translation between clients and subgraphs | +| Connection overhead limits scale | Handle thousands of concurrent subscriptions | + +--- + +## Key Benefits + +1. **Protocol Flexibility**: Support graphql-ws, SSE, Multipart HTTP, subscriptions-transport-ws, and Absinthe protocols between clients and subgraphs without code changes +2. **Efficient Resource Usage**: Connection multiplexing groups subscriptions with identical headers into shared connections, dramatically reducing memory and connection overhead +3. **Zero Configuration**: Subscriptions work immediately with sensible defaults; customize protocols per-subgraph as needed +4. **Legacy Compatibility**: Support older clients using subscriptions-transport-ws or Absinthe while modernizing backend services +5. **Extension Support**: Pass additional metadata like Bearer tokens through the GraphQL extensions field, automatically forwarded to all subgraph requests + +--- + +## Target Audience + +### Primary Persona +- **Role**: Frontend Developer / Full-Stack Developer +- **Pain Points**: Need real-time updates but constrained by client library protocol support; worried about connection overhead in production +- **Goals**: Ship real-time features quickly; ensure they scale with user growth + +### Secondary Personas +- Platform Engineers managing API infrastructure and connection resources +- Backend Developers implementing subgraph subscription resolvers +- DevOps Engineers monitoring WebSocket connection counts and memory usage + +--- + +## Use Cases + +### Use Case 1: Live Dashboard Updates +**Scenario**: A fintech application displays real-time stock prices and portfolio values to thousands of concurrent users +**How it works**: Clients connect via graphql-ws WebSocket protocol. The Router multiplexes all subscriptions for the same stock symbols through shared connections to the pricing subgraph. Header-based authentication ensures proper connection grouping. +**Outcome**: 10,000 connected clients consume the connection resources of hundreds rather than thousands of backend connections + +### Use Case 2: Hybrid Client Support +**Scenario**: An enterprise application has legacy mobile apps using subscriptions-transport-ws and modern web clients using graphql-ws +**How it works**: The Router accepts both protocols from clients while communicating with subgraphs using the modern graphql-ws protocol. Protocol translation is automatic and transparent. +**Outcome**: Teams modernize backend services without breaking existing mobile app versions + +### Use Case 3: Resource-Efficient Notifications +**Scenario**: A collaboration platform needs to push notifications but wants to minimize server resources +**How it works**: Clients use Server-Sent Events (SSE) instead of WebSockets for one-way notification streams. SSE uses less memory than WebSocket connections and works better through certain proxies. +**Outcome**: Notification system scales to more concurrent users with the same infrastructure + +--- + +## Competitive Positioning + +### Key Differentiators +1. Widest protocol support including legacy Absinthe for Phoenix/Elixir ecosystems +2. Automatic connection multiplexing based on authentication context +3. Per-subgraph protocol configuration allowing gradual modernization +4. Built-in extension field forwarding for custom metadata + +### Comparison with Alternatives + +| Aspect | Cosmo | Apollo Router | DIY Gateway | +|--------|-------|---------------|-------------| +| Protocol Options | 5 protocols | 2 protocols | Custom only | +| Connection Multiplexing | Automatic | Limited | Manual | +| Legacy Protocol Support | Yes (Absinthe, subscriptions-transport-ws) | No | DIY | +| Configuration | Per-subgraph | Global | Complex | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We only need WebSockets" | Start with graphql-ws and add SSE or Multipart later as needs evolve; the flexibility is built-in | +| "Our backend doesn't support all protocols" | Configure the optimal protocol per-subgraph; the Router handles translation | +| "WebSocket connections are expensive" | Multiplexing dramatically reduces backend connections; consider SSE for unidirectional updates | + +--- + +## Technical Summary + +### How It Works +The Cosmo Router establishes long-lived connections with clients using their preferred protocol (WebSocket variants, SSE, or Multipart HTTP). When multiple clients subscribe to the same data with matching authentication headers, the Router groups these into a single upstream connection to the subgraph. The Router handles protocol translation, heartbeats, reconnection, and graceful termination. + +### Key Technical Features +- graphql-ws WebSocket subprotocol (default, recommended) +- Server-Sent Events with GET and POST request support +- Multipart HTTP for chunked subscription responses +- subscriptions-transport-ws for legacy client compatibility +- Absinthe (Phoenix) protocol for Elixir ecosystems +- Extension field forwarding for custom authentication tokens +- Header-based subscription grouping for multiplexing +- Automatic connection cleanup on router config updates + +### Integration Points +- Any GraphQL client library supporting standard subscription protocols +- Subgraphs implementing graphql-ws, SSE, or legacy protocols +- Load balancers and proxies (SSE recommended for HTTP/1.1 environments) +- Authentication systems via header forwarding + +### Requirements & Prerequisites +- Cosmo Router deployment +- Subgraphs with subscription support (protocol configurable via CLI) +- Client library supporting at least one of the supported protocols + +--- + +## Proof Points + +### Metrics & Benchmarks +- Connection multiplexing can reduce backend connections by 90%+ for common subscription patterns +- SSE uses approximately 50% less memory per connection compared to WebSockets +- Supports thousands of concurrent subscriptions per Router instance + +--- + +## Documentation References + +- Primary docs: `/docs/router/subscriptions` +- WebSocket protocols: `/docs/router/subscriptions/websocket-subprotocols` +- Server-Sent Events: `/docs/router/subscriptions/server-sent-events-sse` +- Multipart HTTP: `/docs/router/subscriptions/multipart-http-requests` +- Subgraph configuration: `/docs/cli/subgraph/create`, `/docs/cli/subgraph/update` +- Header forwarding: `/docs/router/proxy-capabilities` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL subscriptions +- Real-time GraphQL +- WebSocket GraphQL + +### Secondary Keywords +- graphql-ws +- Server-Sent Events +- SSE GraphQL +- Multipart HTTP subscriptions + +### Related Search Terms +- GraphQL real-time updates +- WebSocket connection pooling +- GraphQL subscription protocols +- Federation subscriptions +- GraphQL push notifications + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/router/config-hot-reload.md b/capabilities/router/config-hot-reload.md new file mode 100644 index 00000000..956d02d6 --- /dev/null +++ b/capabilities/router/config-hot-reload.md @@ -0,0 +1,170 @@ +# Config Hot Reload + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-router-005` | +| **Category** | Router | +| **Status** | GA | +| **Availability** | Free | +| **Related Capabilities** | `cap-router-001`, `cap-router-004` | + +--- + +## Quick Reference + +### Name +Config Hot Reload + +### Tagline +Update router configuration at runtime without downtime. + +### Elevator Pitch +Cosmo Router supports hot-reloading of configuration changes without service interruption. Whether updating from the CDN after schema publishes or watching local configuration files, the router gracefully transitions traffic while maintaining active connections. Zero-downtime deployments become seamless with automatic configuration polling and graceful shutdown handling. + +--- + +## Problem & Solution + +### The Problem +Traditional configuration changes require service restarts, causing connection drops and potential downtime. In high-traffic environments, coordinating configuration rollouts across router fleets is complex. Teams need to update schemas, routing rules, and other configuration without impacting users or triggering Kubernetes pod restarts. + +### The Solution +Cosmo Router automatically polls the CDN for configuration updates (default: every 15 seconds) and gracefully transitions to new configurations. During transitions, both old and new graph instances run simultaneously, ensuring no requests are dropped. Local configuration files can also be watched for changes, and the router responds to SIGHUP signals for on-demand reloads. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Restart required for config changes | Hot-reload without process restart | +| Connection drops during updates | Graceful transition maintains connections | +| Manual coordination of rollouts | Automatic CDN polling across fleet | +| Downtime during schema updates | Zero-downtime configuration updates | + +--- + +## Key Benefits + +1. **Zero-Downtime Updates**: Configuration changes apply without dropping active connections or interrupting service +2. **Graceful Transitions**: Both old and new configurations run simultaneously during the transition period +3. **Automatic CDN Polling**: Schema updates published through the CLI automatically propagate to all routers +4. **File-Based Hot Reload**: Local configuration file changes trigger automatic reloads for development workflows +5. **Signal-Based Control**: Send SIGHUP to trigger immediate reload when needed + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: Coordinating configuration updates across router fleets; avoiding downtime during schema changes; managing graceful deployments +- **Goals**: Achieve zero-downtime deployments; automate configuration propagation; maintain service reliability + +### Secondary Personas +- DevOps engineers managing production deployments +- Backend developers iterating on local configurations +- Release engineers coordinating schema rollouts + +--- + +## Use Cases + +### Use Case 1: Zero-Downtime Schema Updates +**Scenario**: A team publishes a new subgraph schema and needs it to propagate to all production routers without downtime +**How it works**: Publish the schema using the CLI; the CDN is updated; routers detect the change on their next poll interval; each router creates a new graph instance, runs both simultaneously, then gracefully retires the old instance +**Outcome**: New schema is live across all routers with zero dropped requests + +### Use Case 2: Development Configuration Iteration +**Scenario**: A developer is iterating on router configuration locally and wants changes to apply immediately +**How it works**: Enable `watch_config` with an appropriate interval; save configuration changes; the router detects the file modification and reloads automatically +**Outcome**: Rapid configuration iteration without manual restarts + +### Use Case 3: Emergency Configuration Rollout +**Scenario**: An urgent configuration change needs to be applied immediately, not waiting for the next poll interval +**How it works**: Update the configuration file and send `kill -HUP ` to the router process; the router immediately processes the change and reloads +**Outcome**: Immediate configuration update on demand + +--- + +## Technical Summary + +### How It Works +The router polls the CDN for configuration updates at configurable intervals (default: 15 seconds with jitter). When a change is detected, it creates a new graph instance with an optimized query planner. Both instances run simultaneously during the grace period, allowing in-flight requests to complete on the old instance while new requests go to the new one. After the grace period, the old instance is cleaned up. + +### Key Technical Features +- Automatic CDN polling with configurable interval and jitter +- File-based execution config watching with `watch: true` +- SIGHUP signal handling for on-demand reloads +- Configurable grace period for resource cleanup (default: 30s) +- Configurable shutdown delay for server resources (default: 60s) +- Startup delay option to prevent thundering herd in clusters + +### Configuration Options +```yaml +# CDN polling +poll_interval: 10s +poll_jitter: 5s +grace_period: 30s +shutdown_delay: 60s + +# File-based execution config watching +execution_config: + file: + path: "execution-config.json" + watch: true + watch_interval: "5s" + +# Configuration file watching +watch_config: + enabled: true + interval: "10s" + startup_delay: + enabled: false + maximum: "10s" +``` + +### Integration Points +- Cosmo CDN for configuration distribution +- File system for local configuration watching +- POSIX signals (SIGHUP) for manual triggers +- Kubernetes liveness/readiness probes + +### Requirements & Prerequisites +- Network access to CDN for automatic updates +- Appropriate grace period and shutdown delay settings +- Sufficient memory for running dual graph instances during transitions + +### Limitations +- WebSocket and SSE connections close when the old instance shuts down (clients must reconnect) +- Query planner cache is invalidated on configuration swap (temporary latency increase) +- Changes to `watch_config` section are not themselves hot-reloaded +- Environment variable and flag changes require full restart + +--- + +## Documentation References + +- Primary docs: `/docs/router/deployment/config-hot-reload` +- Configuration reference: `/docs/router/configuration` +- Development overview: `/docs/router/development` + +--- + +## Keywords & SEO + +### Primary Keywords +- Hot Reload Configuration +- Zero-Downtime Deployment +- Runtime Configuration Update + +### Secondary Keywords +- Graceful Configuration Reload +- Live Configuration Update +- Config Hot Swap + +### Related Search Terms +- GraphQL router hot reload +- Update router without restart +- Zero downtime schema update +- Graceful configuration change diff --git a/capabilities/router/development-mode.md b/capabilities/router/development-mode.md new file mode 100644 index 00000000..0d9526fb --- /dev/null +++ b/capabilities/router/development-mode.md @@ -0,0 +1,156 @@ +# Development Mode + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-router-006` | +| **Category** | Router | +| **Status** | GA | +| **Availability** | Free | +| **Related Capabilities** | `cap-router-001`, `cap-router-004`, `cap-router-005` | + +--- + +## Quick Reference + +### Name +Development Mode + +### Tagline +Development-optimized settings with verbose error output and simplified setup. + +### Elevator Pitch +Development Mode provides a single configuration toggle that optimizes the Cosmo Router for local development. Enable human-readable logging, detailed error propagation with stack traces, and automatic Docker-to-localhost connectivity. Spend less time configuring and more time building. + +--- + +## Problem & Solution + +### The Problem +Setting up a GraphQL router for local development often requires configuring multiple settings: switching from JSON to human-readable logs, enabling verbose error messages, exposing stack traces for debugging, and handling network connectivity between Docker and localhost. Developers waste time on configuration instead of building features. + +### The Solution +Cosmo Router's Development Mode provides a single `dev_mode: true` toggle that configures all development-friendly settings at once. This includes human-readable logging, full error propagation with status codes and stack traces, and automatic Docker-to-localhost fallback for running subgraphs locally. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Multiple config changes for dev setup | Single `dev_mode: true` toggle | +| JSON logs hard to read during debugging | Human-readable sugared log output | +| Errors hidden or sanitized | Full error details with stack traces | +| Docker-to-localhost networking issues | Automatic localhost fallback inside Docker | + +--- + +## Key Benefits + +1. **One-Line Setup**: Enable comprehensive development settings with a single configuration flag +2. **Human-Readable Logs**: Switch from JSON to sugared log output that is easy to read in terminals +3. **Verbose Error Output**: Propagate subgraph status codes, error locations, and stack traces for faster debugging +4. **Docker Connectivity**: Automatic retry to `docker.host.internal` for requests that fail to connect to localhost +5. **Production Safety**: Clear separation of development and production configurations prevents accidental exposure + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / API Developer +- **Pain Points**: Time wasted on development environment configuration; difficulty debugging federated queries; Docker networking complexity +- **Goals**: Start developing quickly; see detailed error information; seamlessly run subgraphs locally + +### Secondary Personas +- Frontend developers running the router locally for testing +- New team members setting up their development environment +- DevOps engineers creating development environment templates + +--- + +## Use Cases + +### Use Case 1: Quick Local Development Setup +**Scenario**: A developer needs to run the Cosmo Router locally for the first time to test their subgraph changes +**How it works**: Add `dev_mode: true` to the configuration file; start the router; all development-friendly settings are automatically applied +**Outcome**: Ready to develop in seconds with proper logging and error visibility + +### Use Case 2: Debugging Subgraph Errors +**Scenario**: A developer is troubleshooting why a GraphQL operation returns an error from a subgraph +**How it works**: With dev mode enabled, subgraph errors include the HTTP status code in extensions, full error locations, and stack traces when available. The human-readable logs provide additional context. +**Outcome**: Complete error context for rapid debugging without toggling individual settings + +### Use Case 3: Docker with Localhost Subgraphs +**Scenario**: A developer runs the router in Docker but wants to connect to subgraphs running on their host machine +**How it works**: The router automatically detects Docker environment and retries failed localhost connections using `docker.host.internal`. No additional configuration required. +**Outcome**: Seamless Docker-to-localhost connectivity out of the box + +--- + +## Technical Summary + +### How It Works +When `dev_mode: true` is set, the router applies a preset of development-optimized configurations. This includes human-readable logging, full subgraph error propagation, and enabling Advanced Request Tracing (ART) without additional security configuration. Inside Docker, the router detects localhost connection failures and automatically retries with `docker.host.internal`. + +### Key Technical Features +- Single `dev_mode: true` configuration flag +- Human-readable sugared log output (non-JSON) +- Full subgraph HTTP status code propagation +- Error locations included in responses +- Stack trace propagation via `allowed_extension_fields` +- Automatic Docker localhost fallback (enabled by default) +- Advanced Request Tracing (ART) enabled without security headers + +### Configuration Equivalent +Enabling `dev_mode: true` is equivalent to: +```yaml +json_log: false +subgraph_error_propagation: + propagate_status_codes: true + omit_locations: false + allowed_extension_fields: ["code", "stacktrace"] +``` + +### Integration Points +- Local development environments +- Docker-based development setups +- IDE debugging workflows +- Local subgraph testing + +### Requirements & Prerequisites +- Configuration file with `dev_mode: true` +- Not recommended for production environments + +### Important Notes +- To enable specific production features, first disable dev mode, then configure individual settings +- The Docker localhost fallback can be disabled via `LOCALHOST_FALLBACK_INSIDE_DOCKER=false` +- Configuration file hot-reloading works in dev mode for rapid iteration + +--- + +## Documentation References + +- Primary docs: `/docs/router/development` +- Development mode details: `/docs/router/development/development-mode` +- Configuration reference: `/docs/router/configuration` +- Hot reload: `/docs/router/deployment/config-hot-reload` + +--- + +## Keywords & SEO + +### Primary Keywords +- Development Mode +- Local Development +- Developer Experience + +### Secondary Keywords +- GraphQL Development Setup +- Debug Configuration +- Local Testing Mode + +### Related Search Terms +- GraphQL router local development +- Debug federated GraphQL +- Docker localhost GraphQL +- Development environment setup diff --git a/capabilities/router/graphql-federation-router.md b/capabilities/router/graphql-federation-router.md new file mode 100644 index 00000000..b91238ff --- /dev/null +++ b/capabilities/router/graphql-federation-router.md @@ -0,0 +1,141 @@ +# GraphQL Federation Router + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-router-001` | +| **Category** | Router | +| **Status** | GA | +| **Availability** | Free (Apache 2.0 Licensed) | +| **Related Capabilities** | `cap-router-002`, `cap-router-004`, `cap-router-005` | + +--- + +## Quick Reference + +### Name +GraphQL Federation Router + +### Tagline +High-performance Go-based router for federated GraphQL at scale. + +### Elevator Pitch +The Cosmo Router is a production-ready, Apache 2.0 licensed GraphQL Federation router built in Go. It intelligently routes requests across your distributed GraphQL services, aggregates responses, and operates independently while maintaining seamless integration with the Cosmo Control Plane for configuration updates. + +--- + +## Problem & Solution + +### The Problem +Organizations adopting GraphQL Federation need a reliable gateway to orchestrate requests across multiple subgraphs. Existing solutions often suffer from performance bottlenecks, lack of optimization for federated queries, or vendor lock-in through proprietary licensing. Teams struggle with complex query planning, high latency in distributed environments, and the operational burden of managing federation at scale. + +### The Solution +Cosmo Router provides a battle-tested, open-source federation gateway powered by graphql-go-tools - a mature and highly-optimized GraphQL engine. It automatically fetches the latest configuration from the CDN, creates highly-optimized query plans that are cached across requests, and seamlessly updates its engine on-the-fly when schema changes occur. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manual query routing logic across services | Automatic intelligent routing based on federation spec | +| Performance degradation at scale | Highly-optimized query planner with request caching | +| Downtime during configuration updates | Hot-reload configuration without service interruption | +| Proprietary licensing constraints | Apache 2.0 open-source freedom | + +--- + +## Key Benefits + +1. **High Performance**: Built in Go with a focus on performance and maintainability, leveraging graphql-go-tools for optimized query execution +2. **Full Federation Support**: Compatible with GraphQL Federation v1 and v2 specifications out of the box +3. **Independent Operation**: Functions autonomously without depending on Control Plane availability, ensuring resilience +4. **Automatic Updates**: Periodically checks the CDN for configuration updates and reconfigures the engine on-the-fly +5. **Open Source**: Apache 2.0 licensed, eliminating vendor lock-in and enabling customization + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Infrastructure Architect +- **Pain Points**: Need to run a reliable, high-performance federation gateway; concerns about vendor lock-in; require flexibility to customize and extend +- **Goals**: Deploy a production-ready federation layer that scales with the organization; minimize operational overhead; maintain full control over the infrastructure + +### Secondary Personas +- Backend developers implementing federated services +- DevOps engineers managing API gateway infrastructure +- Engineering managers evaluating federation solutions + +--- + +## Use Cases + +### Use Case 1: Production Federation Gateway +**Scenario**: A large e-commerce platform needs to unify multiple GraphQL services (products, inventory, orders, users) into a single federated graph +**How it works**: Deploy the Cosmo Router as the gateway, configure subgraph routing through the Control Plane, and the router automatically handles query planning and response aggregation +**Outcome**: Single unified GraphQL endpoint for clients with optimized query execution across all backend services + +### Use Case 2: Zero-Downtime Schema Updates +**Scenario**: Development teams need to deploy schema changes without interrupting production traffic +**How it works**: Publish schema updates through the CLI; the router automatically detects changes via CDN polling, creates new query plans, and gracefully transitions traffic to the new configuration +**Outcome**: Continuous deployment of schema changes with zero client disruption + +### Use Case 3: Self-Hosted Federation Infrastructure +**Scenario**: An enterprise requires complete control over their federation infrastructure due to compliance requirements +**How it works**: Deploy the router in their own infrastructure, configure it with a static execution config, and integrate with their existing observability stack +**Outcome**: Full federation capabilities with complete infrastructure ownership and compliance adherence + +--- + +## Technical Summary + +### How It Works +The Cosmo Router fetches the latest valid router configuration from the CDN and creates a highly-optimized query planner. This query planner is cached across requests for maximum performance. At configurable intervals, it checks the CDN for new updates and reconfigures its engine on-the-fly. The router registers itself with the Control Plane API to enable reporting on the status and health of the router fleet. + +### Key Technical Features +- GraphQL Federation v1 and v2 protocol support +- Highly-optimized query planner with request-level caching +- Automatic configuration polling and hot-reload +- Health check endpoints for Kubernetes deployments +- GraphQL Playground for development and testing +- Configurable logging, metrics, and tracing + +### Integration Points +- Cosmo Control Plane for configuration and monitoring +- CDN for configuration distribution +- OpenTelemetry for observability +- Any GraphQL Federation v1/v2 compatible subgraph + +### Requirements & Prerequisites +- Graph API token from Cosmo Control Plane (or static execution config) +- Network access to subgraphs +- Optional: Control Plane access for centralized management + +--- + +## Documentation References + +- Primary docs: `/docs/router/intro` +- Configuration guide: `/docs/router/configuration` +- Deployment guide: `/docs/deployments-and-hosting/` +- Development setup: `/docs/router/development` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL Federation Router +- GraphQL Gateway +- Federation Gateway + +### Secondary Keywords +- GraphQL API Gateway +- Federated GraphQL +- GraphQL Router + +### Related Search Terms +- Apollo Federation alternative +- Open source GraphQL federation +- Go GraphQL router +- High performance GraphQL gateway diff --git a/capabilities/router/query-batching.md b/capabilities/router/query-batching.md new file mode 100644 index 00000000..1b3697cf --- /dev/null +++ b/capabilities/router/query-batching.md @@ -0,0 +1,164 @@ +# Query Batching + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-router-003` | +| **Category** | Router | +| **Status** | GA | +| **Availability** | Free | +| **Related Capabilities** | `cap-router-001`, `cap-router-002` | + +--- + +## Quick Reference + +### Name +Query Batching + +### Tagline +Execute multiple GraphQL operations in a single HTTP request. + +### Elevator Pitch +Query batching allows clients to send multiple GraphQL operations in a single HTTP request, with the router processing them concurrently while maintaining response order. This capability supports legacy batch-based clients while providing configurable limits and full observability through dedicated tracing attributes. + +--- + +## Problem & Solution + +### The Problem +Some client applications and legacy systems are designed to batch multiple GraphQL operations into single HTTP requests for efficiency. When migrating to a new federation gateway, these existing patterns need to continue working. Teams need a way to support batched requests while maintaining control over resource consumption and gaining visibility into batch execution. + +### The Solution +Cosmo Router supports processing batched GraphQL requests where multiple operations are sent as a JSON array. Each operation is processed concurrently with configurable limits on batch size and concurrency. Responses maintain the same order as requests, and comprehensive tracing attributes enable full observability of batch execution. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Migration blockers from batch-dependent clients | Seamless support for existing batch patterns | +| No control over batch resource consumption | Configurable max entries and concurrency limits | +| Limited visibility into batch execution | Dedicated tracing attributes for batch operations | +| Serial batch processing | Concurrent processing with configurable parallelism | + +--- + +## Key Benefits + +1. **Migration Compatibility**: Support existing clients that rely on batched requests without requiring client-side changes +2. **Configurable Concurrency**: Control how many operations execute in parallel to manage resource usage +3. **Batch Size Limits**: Prevent resource exhaustion by limiting maximum operations per batch +4. **Response Order Preservation**: Responses always match the request order regardless of completion time +5. **Full Observability**: Dedicated tracing attributes identify batched requests and individual operation indices + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / Platform Engineer +- **Pain Points**: Need to support legacy clients using batch requests; require control over batch behavior; need visibility into batch execution +- **Goals**: Enable batch-based clients to work without modification; maintain system stability; understand batch performance + +### Secondary Personas +- DevOps engineers monitoring system resource usage +- Client developers maintaining batch-based applications + +--- + +## Use Cases + +### Use Case 1: Legacy Client Migration +**Scenario**: A mobile application sends batched GraphQL requests and the team is migrating to Cosmo Router +**How it works**: Enable batching in the router configuration with appropriate limits; deploy the router; existing batch requests continue working without client changes +**Outcome**: Seamless migration with no client-side modifications required + +### Use Case 2: Resource-Controlled Batch Processing +**Scenario**: A team needs to support batch requests but wants to prevent any single batch from consuming excessive resources +**How it works**: Configure `max_entries_per_batch: 50` and `max_concurrency: 5` to limit batch size and parallel execution +**Outcome**: Batch support with predictable resource consumption and protection against oversized requests + +### Use Case 3: Batch Performance Analysis +**Scenario**: Operations team needs to understand batch request patterns and identify optimization opportunities +**How it works**: Use the dedicated batch tracing attributes (`wg.operation.batching.is_batched`, `wg.operation.batching.operations_count`, `wg.operation.batching.operation_index`) to analyze batch behavior in your observability platform +**Outcome**: Data-driven insights into batch patterns enabling targeted optimization + +--- + +## Technical Summary + +### How It Works +When batch mode is enabled, clients send POST requests with a JSON array of operations. The router processes operations concurrently (up to `max_concurrency` at a time) and assembles responses in the original request order. Each operation is independently planned and executed, with errors isolated to individual array positions. + +### Key Technical Features +- Concurrent operation processing with configurable parallelism +- Response order guaranteed to match request order +- Independent error handling per operation +- Dedicated tracing attributes for observability +- Rate limiting applies per operation within batches +- Feature flags apply uniformly across batch operations + +### Configuration Options +```yaml +batching: + enabled: true + max_concurrency: 10 + max_entries_per_batch: 100 + omit_extensions: false +``` + +### Integration Points +- OpenTelemetry tracing with batch-specific attributes +- Rate limiting integration (per-operation limits) +- Feature flags (uniform across batch) + +### Requirements & Prerequisites +- Batching disabled by default (must explicitly enable) +- Subscriptions not supported within batches +- HTTP/2 multiplexing recommended as alternative for new implementations + +--- + +## Competitive Positioning + +### Key Differentiators +1. Configurable concurrency for fine-grained resource control +2. Comprehensive batch-specific tracing attributes +3. Clear guidance on when to use batching vs. HTTP/2 multiplexing + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| Why is batching disabled by default? | Cosmo recommends HTTP/2 multiplexing for new implementations due to better load balancing. Batching is provided for compatibility with existing clients. | +| How does batching affect rate limiting? | Rate limits apply per operation, so a batch of 20 operations counts as 20 towards rate limits, preventing circumvention through batching. | + +--- + +## Documentation References + +- Primary docs: `/docs/router/query-batching` +- Tracing reference: `/docs/router/open-telemetry` +- Rate limiting: `/docs/router/security/hardening-guide` +- Feature flags: `/docs/concepts/feature-flags` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL Query Batching +- GraphQL Batch Requests +- Multiple Operations Single Request + +### Secondary Keywords +- GraphQL Request Batching +- Batch GraphQL Queries +- GraphQL Concurrent Operations + +### Related Search Terms +- Send multiple GraphQL queries at once +- GraphQL batch endpoint +- Optimize GraphQL network requests +- GraphQL request aggregation diff --git a/capabilities/router/query-planning.md b/capabilities/router/query-planning.md new file mode 100644 index 00000000..80681faf --- /dev/null +++ b/capabilities/router/query-planning.md @@ -0,0 +1,140 @@ +# Query Planning + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-router-002` | +| **Category** | Router | +| **Status** | GA | +| **Availability** | Free | +| **Related Capabilities** | `cap-router-001`, `cap-router-003` | + +--- + +## Quick Reference + +### Name +Query Planning + +### Tagline +Intelligent query plan generation for optimal federated execution. + +### Elevator Pitch +Cosmo Router's query planning engine intelligently breaks down GraphQL operations into optimized execution plans across your federated subgraphs. Visualize query plans in the Studio Playground, batch generate plans for testing, and gain deep insights into how your federated queries are resolved. + +--- + +## Problem & Solution + +### The Problem +In federated GraphQL architectures, a single client query may need to fetch data from multiple subgraphs in a specific order while respecting entity relationships. Without proper query planning, this leads to inefficient execution patterns, excessive network calls, and unpredictable performance. Teams lack visibility into how queries are decomposed and executed across their distributed services. + +### The Solution +The Cosmo Router creates highly-optimized query plans that determine the most efficient execution strategy for each operation. Plans are cached across requests to maximize performance. Developers can inspect query plans directly in the Studio Playground using special headers, and batch generate plans offline for testing and CI/CD validation. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Black-box query execution | Transparent query plan visualization | +| Guessing at query efficiency | Clear execution path with timing data | +| Manual testing of query plans | Automated batch plan generation for CI/CD | +| No insight into resolver paths | Visual breakdown of subgraph calls | + +--- + +## Key Benefits + +1. **Execution Transparency**: View the complete query plan showing how operations are decomposed and which subgraphs handle each part +2. **Performance Optimization**: Understand query execution paths to identify optimization opportunities and reduce latency +3. **Plan Caching**: Query plans are cached across requests, eliminating redundant planning overhead +4. **Batch Generation**: Generate query plans for entire operation libraries offline for validation and testing +5. **CI/CD Integration**: Fail builds if operations become unplannable after schema changes + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / API Developer +- **Pain Points**: Difficulty understanding how federated queries resolve; need to validate queries before deployment; troubleshooting slow operations +- **Goals**: Gain visibility into query execution; ensure operations are optimal; catch planning issues before production + +### Secondary Personas +- Platform engineers optimizing federation performance +- QA engineers validating schema changes +- Technical architects designing federated schemas + +--- + +## Use Cases + +### Use Case 1: Debugging Slow Queries +**Scenario**: A GraphQL operation is performing slower than expected, and the team needs to understand why +**How it works**: Add the `X-WG-Include-Query-Plan` header to the request; the query plan is returned in the response extensions field showing the complete execution path and subgraph calls +**Outcome**: Clear visibility into which subgraphs are called and in what order, enabling targeted optimization + +### Use Case 2: CI/CD Query Validation +**Scenario**: A team wants to ensure all production queries remain valid and plannable after schema changes +**How it works**: Use the `router query-plan` CLI command to batch generate query plans for all operations; configure `-fail-on-error` to fail the CI build if any query cannot be planned +**Outcome**: Automated prevention of breaking changes reaching production + +### Use Case 3: Schema Change Impact Analysis +**Scenario**: Before merging a schema change, the team wants to understand how it affects query execution +**How it works**: Generate query plans using both the current and proposed execution configs; compare the plans and timing reports to identify changes in execution patterns +**Outcome**: Data-driven decision making for schema evolution with clear understanding of execution impact + +--- + +## Technical Summary + +### How It Works +When a GraphQL operation arrives, the router's query planner analyzes the federated schema and generates an execution plan. This plan specifies which subgraphs to query, in what order, and how to combine the results. Plans are cached using the operation hash, so subsequent identical operations skip the planning phase entirely. + +### Key Technical Features +- Request the query plan via `X-WG-Include-Query-Plan` header +- Skip subgraph execution with `X-WG-Skip-Loader` for plan-only inspection +- Disable tracing for plan requests with `X-WG-Disable-Tracing` +- Batch plan generation with configurable concurrency +- JSON or text output format for query plans +- Detailed timing metrics (parse, normalize, validate, plan times) + +### Integration Points +- Cosmo Studio Playground for visual inspection +- CI/CD pipelines via CLI command +- Execution config from CDN or local file +- Report generation in JSON format + +### Requirements & Prerequisites +- Router version 0.185.0 or later for batch generation +- Execution configuration file (from CDN or `wgc router compose`) +- Operations folder for batch generation + +--- + +## Documentation References + +- Primary docs: `/docs/router/query-plan` +- Batch generation: `/docs/router/query-plan/batch-generate-query-plans` +- Local development tutorial: `/docs/tutorial/mastering-local-development-for-graphql-federation` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL Query Planning +- Federation Query Plan +- Query Plan Visualization + +### Secondary Keywords +- GraphQL Query Optimization +- Federated Query Execution +- Query Plan Generation + +### Related Search Terms +- How GraphQL federation works +- Optimize federated GraphQL queries +- GraphQL query debugging +- Federation execution plan diff --git a/capabilities/router/router-configuration.md b/capabilities/router/router-configuration.md new file mode 100644 index 00000000..e3af347d --- /dev/null +++ b/capabilities/router/router-configuration.md @@ -0,0 +1,147 @@ +# Router Configuration + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-router-004` | +| **Category** | Router | +| **Status** | GA | +| **Availability** | Free | +| **Related Capabilities** | `cap-router-001`, `cap-router-005`, `cap-router-006` | + +--- + +## Quick Reference + +### Name +Router Configuration + +### Tagline +YAML-based configuration with environment variable expansion and JSON schema validation. + +### Elevator Pitch +Cosmo Router offers a flexible, developer-friendly configuration system using YAML files with environment variable expansion. JSON schema validation provides IDE auto-completion, documentation, and error detection. Multiple configuration files can be merged for environment-specific overrides, eliminating the need for complex templating solutions. + +--- + +## Problem & Solution + +### The Problem +Configuring API gateways often involves complex configuration files with no validation, leading to deployment failures from typos or invalid values. Managing configuration across environments (dev, staging, production) requires either duplicating entire files or using external templating tools. Storing sensitive values like API tokens in configuration files poses security risks. + +### The Solution +Cosmo Router uses YAML configuration with JSON schema validation that integrates with popular IDEs for auto-completion and inline documentation. Environment variable expansion allows secure handling of secrets. Multiple configuration files can be merged with clear precedence rules, enabling clean separation of base settings and environment-specific overrides. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Configuration errors discovered at runtime | JSON schema validation catches errors before deployment | +| No IDE support for configuration | Auto-completion, documentation, and deprecation warnings | +| Complex templating for environment differences | Native configuration file merging with clear precedence | +| Secrets stored in configuration files | Environment variable expansion for sensitive values | + +--- + +## Key Benefits + +1. **IDE Integration**: JSON schema provides auto-completion, inline documentation, deprecation notices, and error detection in VSCode and JetBrains IDEs +2. **Environment Variable Expansion**: Reference environment variables directly in YAML using `${VAR_NAME}` syntax to keep secrets out of files +3. **Configuration Merging**: Load multiple configuration files with predictable merge behavior for environment-specific overrides +4. **Validation Before Deployment**: JSON schema catches typos, invalid values, and deprecated options before the router starts +5. **Clear Precedence Rules**: Well-defined ordering of environment variables, base config, and override files eliminates configuration confusion + +--- + +## Target Audience + +### Primary Persona +- **Role**: DevOps Engineer / Platform Engineer +- **Pain Points**: Complex configuration management across environments; configuration errors causing deployment failures; managing secrets securely +- **Goals**: Streamlined configuration workflow; catch errors early; maintain separate environment configurations cleanly + +### Secondary Personas +- Backend developers setting up local development +- SREs managing production router deployments +- Security engineers reviewing secret management practices + +--- + +## Use Cases + +### Use Case 1: Multi-Environment Configuration +**Scenario**: A team needs different router configurations for development, staging, and production environments +**How it works**: Create a `base.config.yaml` with shared settings, then environment-specific files (`dev.config.yaml`, `prod.config.yaml`) with overrides. Set `CONFIG_PATH=base.config.yaml,prod/prod.config.yaml` per environment. +**Outcome**: Clean separation of shared and environment-specific configuration with no duplication + +### Use Case 2: Secure Secret Management +**Scenario**: Router configuration requires API tokens and credentials that should not be stored in files +**How it works**: Use environment variable expansion in YAML: `token: "${GRAPH_API_TOKEN}"`. Secrets are injected at runtime from secure secret management systems. +**Outcome**: Configuration files can be safely committed to version control; secrets managed through secure infrastructure + +### Use Case 3: IDE-Assisted Configuration +**Scenario**: A developer is configuring the router and needs to discover available options and correct syntax +**How it works**: Add the JSON schema reference to the config file header. The IDE provides auto-completion, shows documentation on hover, and highlights invalid values or deprecated options. +**Outcome**: Faster, error-free configuration with built-in documentation at the developer's fingertips + +--- + +## Technical Summary + +### How It Works +The router reads configuration from a YAML file (default: `config.yaml` in the working directory) or path specified via `CONFIG_PATH`. Multiple files can be specified as comma-separated paths and are merged in order, with later files taking precedence. Environment variables are expanded before validation, and the final configuration is validated against the JSON schema. + +### Key Technical Features +- YAML-based configuration with JSON schema validation +- Environment variable expansion using `${VAR_NAME}` syntax +- Multiple configuration file merging with predictable precedence +- IDE auto-completion and documentation via JSON schema +- Override environment file support via `OVERRIDE_ENV` +- Go duration syntax for intervals and timeouts (e.g., `10s`, `5m`, `1h`) + +### Configuration Merging Rules +- Environment variables are loaded first (lowest precedence) +- YAML configurations override environment variables +- Later files in `CONFIG_PATH` override earlier files +- List values are replaced entirely, not merged +- Empty values in YAML files override non-empty defaults + +### Integration Points +- VSCode with YAML extension +- JetBrains IDEs (built-in support) +- Secret management systems (Vault, AWS Secrets Manager, etc.) +- CI/CD pipelines for configuration validation + +### Requirements & Prerequisites +- YAML configuration file(s) +- Optional: IDE with JSON schema support +- Optional: Secret management system for environment variables + +--- + +## Documentation References + +- Primary docs: `/docs/router/configuration` +- Configuration design: `/docs/router/configuration/config-design` +- Hot reload: `/docs/router/deployment/config-hot-reload` + +--- + +## Keywords & SEO + +### Primary Keywords +- Router Configuration +- YAML Configuration +- GraphQL Router Config + +### Secondary Keywords +- Environment Variable Expansion +- Configuration Validation +- JSON Schema Configuration + +### Related Search Terms +- GraphQL gateway configuration +- API router setup +- Federation router settings +- Configuration file best practices diff --git a/capabilities/security/authorization-directives.md b/capabilities/security/authorization-directives.md new file mode 100644 index 00000000..fd6e6d42 --- /dev/null +++ b/capabilities/security/authorization-directives.md @@ -0,0 +1,159 @@ +# Authorization Directives + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-sec-002` | +| **Category** | Security | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-sec-001` (JWT Authentication) | + +--- + +## Quick Reference + +### Name +Authorization Directives + +### Tagline +Field-level access control with @authenticated and @requiresScopes. + +### Elevator Pitch +Cosmo's authorization directives provide declarative, schema-driven access control for your federated GraphQL API. Using @authenticated and @requiresScopes directives, you can enforce authentication and fine-grained permission requirements directly in your schema, ensuring that sensitive data is protected at the field level without writing custom authorization logic. + +--- + +## Problem & Solution + +### The Problem +Traditional authorization approaches require implementing access control logic in resolvers, leading to scattered security code, inconsistent enforcement, and maintenance challenges. In federated architectures, this problem multiplies as each subgraph must independently implement authorization, risking security gaps and policy drift. Teams struggle to understand what data requires authentication and what permissions are needed to access specific fields. + +### The Solution +Cosmo's authorization directives embed access control requirements directly in the GraphQL schema using @authenticated and @requiresScopes. The router evaluates these requirements before executing queries, returning clear authorization errors when requirements are not met. This approach provides a single source of truth for authorization policies, automatic enforcement across the federated graph, and clear documentation of access requirements visible in the schema itself. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Authorization logic scattered across resolvers | Declarative directives in the schema | +| Inconsistent enforcement across subgraphs | Automatic propagation to federated graph | +| No visibility into access requirements | Schema documents authorization requirements | +| Custom code for every protected field | Zero-code authorization with directives | + +--- + +## Key Benefits + +1. **Schema-Driven Security**: Authorization requirements are visible directly in your GraphQL schema, serving as documentation and enforcement simultaneously. + +2. **Automatic Federation**: Directives declared in any subgraph automatically propagate to the federated graph, ensuring consistent enforcement regardless of which subgraph resolves a field. + +3. **Granular Control**: Apply directives at the field, object, interface, enum, or scalar level, with automatic propagation to all relevant field definitions. + +4. **Flexible Scope Logic**: @requiresScopes supports complex AND/OR logic for permissions, enabling sophisticated access control policies like "(admin AND write) OR superuser". + +5. **Graceful Degradation**: Nullable fields return partial data with authorization errors, while non-nullable fields fail the entire query, providing predictable behavior. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / API Developer +- **Pain Points**: Implementing consistent authorization across services; documenting access requirements; preventing unauthorized data access +- **Goals**: Declarative security policies; reduced boilerplate code; clear authorization documentation + +### Secondary Personas +- Security engineers auditing API access controls +- Architects designing secure data access patterns +- Compliance officers reviewing data protection measures + +--- + +## Use Cases + +### Use Case 1: Protecting Sensitive User Data +**Scenario**: A user profile API must ensure that personal information (email, phone, address) is only accessible to authenticated users, while public information (username, avatar) is available to everyone. + +**How it works**: Apply @authenticated to sensitive fields in the User type. Unauthenticated requests receive partial data with public fields, while authenticated requests receive the complete profile. + +**Outcome**: Sensitive data is automatically protected without resolver modifications, and the schema clearly documents which fields require authentication. + +### Use Case 2: Role-Based Access Control +**Scenario**: An HR system has employee records with different access levels: basic info is readable by all employees, salary data requires HR permissions, and performance reviews require management access. + +**How it works**: Use @requiresScopes with different scope requirements: `@requiresScopes(scopes: [["hr:read"]])` for salary fields and `@requiresScopes(scopes: [["management:read"], ["hr:admin"]])` for performance reviews (OR logic: managers OR HR admins). + +**Outcome**: Fine-grained access control based on user roles, with clear error messages indicating required permissions when access is denied. + +### Use Case 3: Multi-Subgraph Authorization Consistency +**Scenario**: A product catalog spans multiple subgraphs (inventory, pricing, reviews), and pricing data should only be visible to authenticated B2B customers. + +**How it works**: Declare @authenticated on the price field in the pricing subgraph. The directive automatically propagates to the federated schema, enforcing authentication regardless of query path or subgraph resolution. + +**Outcome**: Consistent authorization enforcement across the entire federated graph, with subgraph teams maintaining local control over their authorization requirements. + +--- + +## Technical Summary + +### How It Works +Authorization directives are evaluated at the router level before query execution. When a request includes fields with @authenticated or @requiresScopes directives, the router checks the authenticated claims from the JWT token. For @authenticated, any valid authentication satisfies the requirement. For @requiresScopes, the router evaluates the scope requirements using the specified AND/OR logic against the token's scope claims. Failed authorization results in error responses with clear messages indicating the requirement and actual permissions. + +### Key Technical Features +- @authenticated directive for simple authentication checks +- @requiresScopes with nested array syntax for AND/OR logic +- Automatic directive propagation from types to fields +- Interface directive propagation to implementing types +- Cross-subgraph scope combination via matrix multiplication +- Partial data support for nullable fields +- Clear error messages with required vs. actual scopes + +### Integration Points +- JWT Authentication for token validation and claim extraction +- Federation composition for directive propagation +- Custom modules for accessing authentication context +- GraphQL introspection for viewing authorization requirements + +### Requirements & Prerequisites +- Router version 0.60.0 or higher +- Control plane version 0.58.0 or higher +- wgc CLI version 0.39.0 or higher +- JWT Authentication configured for the router + +--- + +## Documentation References + +- @authenticated directive: `/docs/federation/directives/authenticated` +- @requiresScopes directive: `/docs/federation/directives/requiresscopes` +- Authentication setup: `/docs/router/authentication-and-authorization` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL authorization +- Field-level access control +- GraphQL directives security + +### Secondary Keywords +- @authenticated directive +- @requiresScopes directive +- Role-based access control GraphQL + +### Related Search Terms +- GraphQL field authorization +- Federated GraphQL security +- Declarative authorization GraphQL + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/security/config-signing.md b/capabilities/security/config-signing.md new file mode 100644 index 00000000..7194ed3a --- /dev/null +++ b/capabilities/security/config-signing.md @@ -0,0 +1,164 @@ +# Config Signing + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-sec-004` | +| **Category** | Security | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-sec-005` (Security Hardening) | + +--- + +## Quick Reference + +### Name +Config Signing + +### Tagline +HMAC-SHA256 signature verification for tamper-proof configurations. + +### Elevator Pitch +Cosmo's Config Validation & Signing feature protects your router configuration from tampering attacks using cryptographic signatures. An admission webhook validates every composition before deployment, generating an HMAC-SHA256 signature that the router verifies before applying any configuration changes. This prevents attackers from redirecting your traffic to unauthorized servers. + +--- + +## Problem & Solution + +### The Problem +Router configurations contain critical information including subgraph URLs and execution rules. Whether fetched from CDN or deployed via files, these configurations could be tampered with by attackers who might modify subgraph URLs to redirect traffic to malicious servers, enabling data theft or service disruption. Without verification, routers have no way to detect if configurations have been modified in transit or at rest. + +### The Solution +Cosmo's Config Signing implements a cryptographic chain of trust. When a composition occurs, the control plane calls your admission webhook with a private URL to fetch the configuration. Your webhook validates the configuration (checking subgraph URLs, policies, etc.) and returns an HMAC-SHA256 signature. This signature is stored with the configuration artifact. When the router fetches or loads a configuration, it independently calculates the hash and compares it to the stored signature, rejecting any configuration that doesn't match. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Configurations trusted implicitly | Cryptographic verification of every configuration | +| No protection against tampering | HMAC-SHA256 signature validation | +| Subgraph URLs could be modified | Admission webhook validates all URLs | +| Silent acceptance of modified configs | Router rejects tampered configurations | + +--- + +## Key Benefits + +1. **Tamper Detection**: Cryptographic signatures detect any modification to the configuration, whether in transit or at rest. + +2. **Custom Validation**: Your admission webhook can implement custom validation rules, such as ensuring all subgraph URLs belong to your organization's domain. + +3. **Defense in Depth**: Combines signature verification with your own business logic validation for comprehensive protection. + +4. **Seamless Integration**: Works with both CDN polling and file-based configuration deployment. + +5. **Audit Trail**: Failed validations are visible in the Studio, providing visibility into configuration issues. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Security Engineer / Platform Engineer +- **Pain Points**: Ensuring configuration integrity; preventing supply chain attacks; maintaining audit trails for compliance +- **Goals**: Cryptographic verification of all configurations; custom validation policies; tamper-evident deployments + +### Secondary Personas +- DevOps engineers managing router deployments +- Compliance officers requiring configuration integrity verification +- Architects designing secure deployment pipelines + +--- + +## Use Cases + +### Use Case 1: Preventing Subgraph URL Manipulation +**Scenario**: An attacker gains access to the CDN or file system and attempts to modify subgraph URLs to redirect traffic to a malicious server for data exfiltration. + +**How it works**: The admission webhook validates that all subgraph URLs belong to the organization's domain (e.g., *.wundergraph.com) and signs the configuration. The router verifies this signature before applying the configuration. + +**Outcome**: Modified configurations are rejected by the router, which continues operating with the last known good configuration or refuses to start if no valid configuration exists. + +### Use Case 2: Secure CI/CD Pipeline Deployment +**Scenario**: A team deploys router configurations via their CI/CD pipeline using `wgc router fetch` and needs to ensure configurations aren't modified during the deployment process. + +**How it works**: The `wgc router fetch` command includes the `--graph-sign-key` parameter to fetch signed configurations. The router validates signatures before applying file-based configurations. + +**Outcome**: End-to-end configuration integrity from composition through deployment, with cryptographic verification at every step. + +### Use Case 3: Webhook Request Authentication +**Scenario**: An organization wants to ensure that webhook calls to their admission server genuinely originate from Cosmo's control plane. + +**How it works**: Configure `--admission-webhook-secret` when creating the federated graph. The control plane includes an HMAC signature in the `X-Cosmo-Signature-256` header, which the admission server verifies before processing. + +**Outcome**: Bidirectional authentication ensuring both the webhook request origin and the configuration integrity are verified. + +--- + +## Technical Summary + +### How It Works +1. When a composition occurs, the control plane calls your admission webhook's `/validate-config` endpoint with `federatedGraphId`, `organizationId`, and a short-lived `privateConfigUrl`. +2. Your webhook fetches the configuration from the private URL, validates it according to your policies, and calculates an HMAC-SHA256 hash using your signing key. +3. The webhook returns the BASE64-encoded signature (or an error to block deployment). +4. The control plane stores the signature with the configuration artifact. +5. When the router loads a configuration, it calculates the hash using the same signing key and compares it to the stored signature. +6. If signatures match, the configuration is applied; otherwise, it's rejected. + +### Key Technical Features +- HMAC-SHA256 signature algorithm (industry standard) +- Short-lived private URLs for configuration retrieval (5-minute expiry) +- Admission webhook for custom validation logic +- Webhook request signing via X-Cosmo-Signature-256 header +- Support for both CDN and file-based configuration +- Configuration rejection without service disruption + +### Integration Points +- Admission webhook server (Cloudflare Workers, Fastly, Deno, Bun, Node.js) +- CI/CD pipelines via wgc CLI +- CDN for configuration distribution +- Cosmo Studio for deployment status visibility + +### Requirements & Prerequisites +- Router version 0.74.0 or higher +- Publicly accessible admission webhook server (HTTPS required) +- Signing key shared between admission server and router +- `@wundergraph/cosmo-shared` npm package for configuration parsing + +--- + +## Documentation References + +- Primary docs: `/docs/router/security/config-validation-and-signing` +- CLI reference: `/docs/cli/router/fetch` +- Example implementation: https://github.com/wundergraph/cosmo/tree/main/admission-server +- Hardening guide: `/docs/router/security/hardening-guide` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL configuration signing +- HMAC signature verification +- Tamper-proof configuration + +### Secondary Keywords +- Admission webhook +- Configuration integrity +- Supply chain security GraphQL + +### Related Search Terms +- GraphQL security best practices +- Prevent configuration tampering +- Signed configuration deployment + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/security/introspection-control.md b/capabilities/security/introspection-control.md new file mode 100644 index 00000000..00e8cc34 --- /dev/null +++ b/capabilities/security/introspection-control.md @@ -0,0 +1,178 @@ +# Introspection Control + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-sec-006` | +| **Category** | Security | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-sec-005` (Security Hardening), `cap-sec-001` (JWT Authentication) | + +--- + +## Quick Reference + +### Name +Introspection Control + +### Tagline +Disable schema introspection in production environments. + +### Elevator Pitch +Cosmo Router allows you to control GraphQL introspection queries, a powerful feature that exposes your entire API schema. While essential for development and tooling, introspection in production reveals your API structure to potential attackers. Disable introspection to hide your schema or selectively bypass authentication for introspection in secure internal environments. + +--- + +## Problem & Solution + +### The Problem +GraphQL introspection allows clients to query the complete schema, including all types, fields, queries, mutations, and their descriptions. While invaluable during development, this feature in production environments provides attackers with a complete map of your API, revealing potential attack vectors, sensitive data types, and internal naming conventions. Organizations need control over when and how introspection is available. + +### The Solution +Cosmo Router provides granular control over introspection behavior. In production, disable introspection entirely to hide your schema from potential attackers. For internal environments, selectively bypass authentication for introspection queries to support tooling while maintaining protection for regular queries. An optional introspection secret adds an additional layer of security. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Schema exposed to all clients | Introspection disabled in production | +| No control over schema visibility | Granular introspection settings | +| All-or-nothing authentication | Optional introspection authentication bypass | +| No introspection-specific secrets | Dedicated secret for secure introspection | + +--- + +## Key Benefits + +1. **Schema Protection**: Hide your complete API schema from unauthorized users and potential attackers. + +2. **Flexible Authentication Bypass**: Allow introspection queries without authentication for internal tooling while requiring authentication for all other queries. + +3. **Dedicated Introspection Secret**: Optionally protect introspection access with a separate secret, independent of your authentication system. + +4. **Development-Production Parity**: Maintain similar configurations between environments with just the introspection toggle changed. + +5. **Tooling Compatibility**: Support schema-aware tools in controlled environments without exposing the schema publicly. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / DevOps Engineer +- **Pain Points**: Balancing tooling needs with security; preventing schema exposure; managing different requirements across environments +- **Goals**: Secure production deployments; functional development environments; controlled schema access + +### Secondary Personas +- Security engineers defining API exposure policies +- Backend developers using GraphQL tooling +- Architects designing secure API gateways + +--- + +## Use Cases + +### Use Case 1: Secure Production Deployment +**Scenario**: An organization is deploying their GraphQL API to production and needs to prevent schema discovery by potential attackers. + +**How it works**: Set `introspection.enabled: false` in the production router configuration. All introspection queries return an error, hiding the complete API structure. + +**Outcome**: The API schema is hidden from clients, eliminating schema discovery as an attack vector. + +### Use Case 2: Internal Tooling with Authentication +**Scenario**: A team uses GraphQL tooling (Postman, GraphQL Playground, custom IDEs) in their secure internal network but has authentication enabled on the router. They want tools to work without requiring token setup. + +**How it works**: Enable `authentication.ignore_introspection: true` to bypass JWT validation for introspection queries only. Regular queries still require valid authentication tokens. + +**Outcome**: Tooling works seamlessly in internal environments while all data queries remain protected by authentication. + +### Use Case 3: Secret-Protected Introspection +**Scenario**: A team needs introspection available for specific tooling but wants an additional security layer beyond network access. + +**How it works**: Configure `introspection.secret` with a dedicated value. Tools must include this secret in the Authorization header (without Bearer prefix) to perform introspection. + +**Outcome**: Introspection is available only to tools configured with the secret, adding defense in depth. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router intercepts all incoming GraphQL queries and checks if they are introspection queries (queries on `__schema` or `__type`). Based on configuration, the router either allows, blocks, or applies special authentication handling to these queries. When introspection is disabled, the router returns an error without executing the query. When authentication bypass is enabled for introspection, the router skips JWT validation for introspection queries while still requiring it for all other operations. + +### Key Technical Features + +**Disable Introspection** +```yaml +introspection: + enabled: false +``` + +**Bypass Authentication for Introspection** +```yaml +authentication: + ignore_introspection: true + # other auth settings +``` + +**Introspection Secret** +```yaml +introspection: + secret: 'dedicated_secret_for_introspection' +``` + +**Introspection Query Example** +```bash +curl -X POST http://localhost:3002/graphql \ + --header "Content-Type: application/json" \ + --header "Authorization: dedicated_secret_for_introspection" \ + --data '{"query": "{ __schema { types { name } } }"}' +``` + +### Integration Points +- JWT Authentication (for bypass configuration) +- GraphQL tooling (Postman, Apollo Studio, GraphQL Playground) +- CI/CD pipelines for schema validation +- Schema registries + +### Requirements & Prerequisites +- Router configured with authentication for bypass feature +- Introspection secret if using secret-protected access +- Network security for internal tooling scenarios + +--- + +## Documentation References + +- Primary docs: `/docs/router/security/hardening-guide` (Disable Introspection section) +- Authentication configuration: `/docs/router/authentication-and-authorization` +- Router configuration: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL introspection security +- Disable GraphQL introspection +- GraphQL schema protection + +### Secondary Keywords +- GraphQL production security +- Hide GraphQL schema +- Introspection query control + +### Related Search Terms +- Should I disable GraphQL introspection +- GraphQL introspection attack +- Secure GraphQL schema + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/security/jwt-authentication.md b/capabilities/security/jwt-authentication.md new file mode 100644 index 00000000..d6bb8393 --- /dev/null +++ b/capabilities/security/jwt-authentication.md @@ -0,0 +1,159 @@ +# JWT Authentication + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-sec-001` | +| **Category** | Security | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-sec-002` (Authorization Directives) | + +--- + +## Quick Reference + +### Name +JWT Authentication + +### Tagline +Secure your GraphQL API with industry-standard JWT validation. + +### Elevator Pitch +Cosmo Router provides enterprise-grade JWT authentication using JWKS (JSON Web Key Sets), enabling seamless integration with any OAuth 2.0 or OpenID Connect identity provider. Configure multiple authentication providers, support various signing algorithms, and protect your federated graph with zero custom code. + +--- + +## Problem & Solution + +### The Problem +Organizations deploying GraphQL APIs face the challenge of implementing robust authentication that works across their entire federated architecture. Teams often struggle with integrating multiple identity providers, handling token validation at scale, and ensuring consistent security policies across all subgraphs. Without proper authentication, APIs are vulnerable to unauthorized access and data breaches. + +### The Solution +Cosmo Router's JWT Authentication provides a centralized, configuration-driven approach to securing your GraphQL API. By validating JWTs at the router level before requests reach subgraphs, it ensures consistent security enforcement. The router automatically fetches and caches JWKS from your identity providers, validates tokens against configured algorithms, and makes authenticated claims available throughout the request pipeline. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Each subgraph implements its own authentication logic | Single point of authentication at the router | +| Complex integration code for each identity provider | Declarative YAML configuration for multiple providers | +| Inconsistent security policies across services | Uniform authentication enforcement across all subgraphs | +| Manual key rotation handling | Automatic JWKS refresh with configurable intervals | + +--- + +## Key Benefits + +1. **Multi-Provider Support**: Configure multiple JWKS endpoints to support various identity providers simultaneously, with automatic fallback and priority ordering. + +2. **Automatic Key Management**: JWKS keys are automatically fetched and refreshed at configurable intervals, with intelligent on-demand refresh for unknown Key IDs during key rotation. + +3. **Flexible Token Sources**: Extract tokens from multiple header sources with custom prefixes, supporting various authentication schemes beyond standard Bearer tokens. + +4. **Algorithm Whitelisting**: Specify allowed JWT algorithms per JWKS endpoint to prevent algorithm confusion attacks and ensure cryptographic security. + +5. **Symmetric Key Support**: In addition to asymmetric JWKS endpoints, support for symmetric algorithms (like HS256) with secure secret configuration. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / Security Engineer +- **Pain Points**: Implementing consistent authentication across distributed services; managing multiple identity providers; ensuring secure key rotation +- **Goals**: Centralized security enforcement; reduced authentication code; compliance with security standards + +### Secondary Personas +- Backend developers building authenticated GraphQL services +- DevOps engineers managing API gateway security +- Architects designing secure federated architectures + +--- + +## Use Cases + +### Use Case 1: Multi-Tenant SaaS Authentication +**Scenario**: A SaaS platform needs to authenticate users from multiple enterprise customers, each with their own identity provider (Okta, Auth0, Azure AD). + +**How it works**: Configure multiple JWKS endpoints in the router configuration, one for each customer's identity provider. The router tries each provider in order until authentication succeeds, extracting claims from the validated token. + +**Outcome**: Seamless multi-tenant authentication without custom code, supporting customer-specific identity providers while maintaining a unified API. + +### Use Case 2: Secure Internal API Gateway +**Scenario**: An organization wants to enforce authentication on all GraphQL requests while allowing introspection queries for internal tooling without tokens. + +**How it works**: Enable JWT authentication with `ignore_introspection: true` and optionally set an introspection secret. Regular queries require valid JWTs, while introspection queries from trusted internal tools bypass authentication. + +**Outcome**: Strong authentication for production traffic with developer-friendly introspection access for internal tooling. + +### Use Case 3: Zero-Downtime Key Rotation +**Scenario**: The security team rotates JWT signing keys monthly, but tokens signed with old keys remain valid during the transition period. + +**How it works**: Enable `refresh_unknown_kid` with rate limiting. When the router encounters a token with an unknown Key ID, it automatically fetches updated JWKS, supporting seamless key rotation without service interruption. + +**Outcome**: Smooth key rotation with no manual intervention, maintaining security while ensuring zero downtime. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router intercepts incoming GraphQL requests and extracts JWT tokens from configured header sources. It validates tokens against cached JWKS from configured endpoints, checking signatures, algorithms, and optional audience claims. Valid tokens have their claims decoded and made available to the request pipeline, including custom modules and authorization directives. Invalid tokens result in 403 Forbidden responses, while missing tokens (when authentication is not required) allow anonymous access. + +### Key Technical Features +- Multiple JWKS endpoint configuration with per-endpoint algorithm whitelisting +- Configurable token extraction from headers with custom prefixes +- Automatic JWKS refresh with configurable intervals +- On-demand refresh for unknown Key IDs with rate limiting +- Symmetric algorithm support for HS256/HS384/HS512 +- Optional audience validation per JWKS endpoint +- Introspection bypass with optional secret authentication + +### Integration Points +- Any OAuth 2.0 / OpenID Connect identity provider +- Custom modules for accessing authenticated claims +- Authorization directives (@authenticated, @requiresScopes) +- Request tracing and observability + +### Requirements & Prerequisites +- Router version 0.60.0 or higher for authorization directive support +- HTTPS endpoints for JWKS (recommended for security) +- Valid JWT tokens from configured identity providers + +--- + +## Documentation References + +- Primary docs: `/docs/router/authentication-and-authorization` +- Authorization directives: `/docs/federation/directives/authenticated` +- Scopes directive: `/docs/federation/directives/requiresscopes` +- Configuration reference: `/docs/router/configuration#authentication` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL JWT authentication +- JWKS authentication +- GraphQL API security + +### Secondary Keywords +- OAuth 2.0 GraphQL +- OpenID Connect GraphQL +- Token validation + +### Related Search Terms +- How to secure GraphQL API +- GraphQL authentication best practices +- JWT validation in GraphQL federation + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/security/security-hardening.md b/capabilities/security/security-hardening.md new file mode 100644 index 00000000..18cfa543 --- /dev/null +++ b/capabilities/security/security-hardening.md @@ -0,0 +1,217 @@ +# Security Hardening + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-sec-005` | +| **Category** | Security | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-sec-003` (TLS/HTTPS), `cap-sec-004` (Config Signing), `cap-sec-006` (Introspection Control), `cap-sec-007` (Subgraph Error Propagation) | + +--- + +## Quick Reference + +### Name +Security Hardening + +### Tagline +Best practices for production-ready GraphQL deployments. + +### Elevator Pitch +Cosmo's Security Hardening Guide provides comprehensive best practices for deploying GraphQL in production environments. From disabling introspection and development mode to enabling rate limiting and TLS, these recommendations help you minimize attack surface, protect sensitive data, and maintain a resilient API infrastructure. + +--- + +## Problem & Solution + +### The Problem +GraphQL APIs deployed with default configurations often expose unnecessary attack vectors. Development-friendly features like introspection, verbose error messages, and open CORS policies become security liabilities in production. Teams may not be aware of all the configuration options available to harden their deployments, leaving APIs vulnerable to abuse, data leaks, and denial-of-service attacks. + +### The Solution +Cosmo's Security Hardening Guide provides a systematic checklist of security configurations for production deployments. Each recommendation addresses a specific attack vector with clear configuration examples. By following these guidelines, teams can significantly reduce their attack surface while maintaining the functionality needed for their use cases. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Default configuration with all features enabled | Minimal attack surface with only required features | +| Schema exposed via introspection | Introspection disabled in production | +| Open CORS allowing any origin | Restricted CORS to trusted domains | +| All query types allowed | Unused operations (mutations/subscriptions) disabled | + +--- + +## Key Benefits + +1. **Reduced Attack Surface**: Disable features not needed in production like introspection, file uploads, and development mode. + +2. **Rate Limiting Protection**: GCRA-based rate limiting protects subgraphs from overload and prevents abuse. + +3. **Persistent Operations Security**: Block non-persisted queries to allow only pre-approved operations. + +4. **CORS Hardening**: Restrict allowed origins, methods, and headers to trusted sources. + +5. **TLS and HTTP/2**: Enable encrypted communication with automatic HTTP/2 performance benefits. + +--- + +## Target Audience + +### Primary Persona +- **Role**: DevOps Engineer / Platform Engineer +- **Pain Points**: Ensuring production security; understanding all configuration options; balancing security with functionality +- **Goals**: Hardened production deployments; compliance with security policies; reduced incident risk + +### Secondary Personas +- Security engineers conducting deployment reviews +- Architects establishing security baselines +- Operations teams responding to security incidents + +--- + +## Use Cases + +### Use Case 1: Production Deployment Checklist +**Scenario**: A team is preparing to deploy their Cosmo Router to production and needs to ensure all security configurations are properly set. + +**How it works**: Follow the hardening guide checklist: disable introspection, disable dev_mode, enable TLS, configure CORS restrictions, enable rate limiting, consider config signing, and review persistent operations. + +**Outcome**: A production-ready deployment with minimized attack surface and appropriate security controls. + +### Use Case 2: Protecting Against API Abuse +**Scenario**: An API is experiencing abuse from automated requests causing performance degradation. + +**How it works**: Enable Redis-backed rate limiting with GCRA algorithm. Configure appropriate rate, burst, and period values. Consider enabling persistent operations to block arbitrary queries. + +**Outcome**: Abuse is mitigated through rate limiting, and persistent operations ensure only approved queries execute. + +### Use Case 3: Minimizing Data Exposure +**Scenario**: A financial services API must minimize the risk of data exposure through the GraphQL endpoint. + +**How it works**: Disable introspection to hide the schema, enable config signing to prevent tampering, configure restrictive CORS, use TLS for encryption, and set log level to ERROR to reduce logging of sensitive data. + +**Outcome**: Multiple layers of protection reducing the risk of data exposure through various attack vectors. + +--- + +## Technical Summary + +### How It Works +The Security Hardening Guide provides configuration-level controls across multiple security domains. Each setting is applied via the router's YAML configuration file, with sensible defaults that can be overridden for production environments. The router enforces these settings at runtime, rejecting requests that violate configured policies. + +### Key Technical Features + +**Introspection Control** +```yaml +introspection: + enabled: false +``` + +**Development Mode** +```yaml +dev_mode: false +``` + +**File Upload Control** +```yaml +file_upload: + enabled: false +``` + +**Rate Limiting** +```yaml +rate_limit: + enabled: true + storage: + urls: + - redis://localhost:6379 + simple_strategy: + rate: 100 + burst: 200 + period: 1s +``` + +**CORS Restrictions** +```yaml +cors: + allow_methods: ["POST", "GET"] + allow_origins: ["mydomain.com"] + allow_credentials: true +``` + +**Persistent Operations** +```yaml +security: + block_non_persisted_operations: + enabled: true + block_subscriptions: + enabled: true + block_mutations: + enabled: true +``` + +**TLS Configuration** +```yaml +tls: + server: + enabled: true + key_file: ../your/key.pem + cert_file: ../your/cert.pem +``` + +**Log Level** +```yaml +log_level: "error" +``` + +### Integration Points +- Redis for rate limiting storage +- Certificate management for TLS +- Config signing infrastructure +- Persistent operations registry + +### Requirements & Prerequisites +- Redis instance for rate limiting +- TLS certificates for HTTPS +- Admission webhook for config signing +- Persistent operations uploaded via wgc CLI + +--- + +## Documentation References + +- Primary docs: `/docs/router/security/hardening-guide` +- TLS configuration: `/docs/router/security/tls` +- Config signing: `/docs/router/security/config-validation-and-signing` +- Persistent operations: `/docs/router/persisted-queries/persisted-operations` +- Rate limiting: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL security hardening +- Production GraphQL security +- GraphQL best practices + +### Secondary Keywords +- GraphQL rate limiting +- Disable GraphQL introspection +- GraphQL CORS configuration + +### Related Search Terms +- Secure GraphQL deployment +- GraphQL security checklist +- Production GraphQL configuration + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/security/subgraph-error-propagation.md b/capabilities/security/subgraph-error-propagation.md new file mode 100644 index 00000000..82e57117 --- /dev/null +++ b/capabilities/security/subgraph-error-propagation.md @@ -0,0 +1,214 @@ +# Subgraph Error Propagation + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-sec-007` | +| **Category** | Security | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-sec-005` (Security Hardening) | + +--- + +## Quick Reference + +### Name +Subgraph Error Propagation + +### Tagline +Control error exposure with sensitive data masking. + +### Elevator Pitch +Cosmo Router provides fine-grained control over how subgraph errors are exposed to clients. Choose between wrapped mode that encapsulates errors generically or pass-through mode that forwards errors directly. Configure exactly which error fields and extensions are visible to prevent leaking sensitive internal information while still providing meaningful error messages for debugging. + +--- + +## Problem & Solution + +### The Problem +Subgraph error messages often contain sensitive internal information: stack traces, database errors, internal service names, infrastructure details, and debugging information. Exposing these details to clients creates security risks, leaking information that attackers can use to understand your architecture and identify vulnerabilities. However, completely hiding errors makes debugging difficult and provides poor user experience. + +### The Solution +Cosmo Router's Subgraph Error Propagation feature provides configurable control over error exposure. Wrapped mode encapsulates subgraph errors in generic messages while preserving details for debugging. Pass-through mode forwards errors directly with configurable field filtering. Both modes support fine-grained control over which extension fields are exposed, allowing you to balance security with debugging needs. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Raw subgraph errors exposed to clients | Configurable error wrapping and filtering | +| Sensitive stack traces leaked | Only approved fields exposed | +| No control over error extensions | Whitelist specific extension fields | +| Internal service names visible | Optional service name attachment | + +--- + +## Key Benefits + +1. **Configurable Error Modes**: Choose wrapped mode for maximum protection or pass-through mode for transparency, each with extensive configuration options. + +2. **Extension Field Filtering**: Whitelist specific extension fields like "code" while blocking sensitive debugging information. + +3. **Service Name Control**: Optionally include or exclude subgraph service names in errors for client-side routing or debugging. + +4. **Location Stripping**: Remove subgraph-specific line/column locations that are irrelevant to clients and may reveal internal structure. + +5. **Status Code Propagation**: Optionally include HTTP status codes for client-side error handling logic. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Backend Developer / API Developer +- **Pain Points**: Balancing debugging information with security; preventing information leakage; providing useful error messages +- **Goals**: Secure error responses; meaningful client error messages; efficient debugging in production + +### Secondary Personas +- Security engineers reviewing API error exposure +- Frontend developers consuming error responses +- DevOps engineers monitoring production errors + +--- + +## Use Cases + +### Use Case 1: Maximum Security Error Handling +**Scenario**: A financial services API must prevent any internal information from leaking through error messages. + +**How it works**: Configure wrapped mode with `omit_extensions: true`. All subgraph errors are wrapped in generic "Failed to fetch from Subgraph" messages with no additional details exposed to clients. + +**Outcome**: Zero information leakage through error responses, with details available only in server logs. + +### Use Case 2: Structured Error Handling with Codes +**Scenario**: A client application implements error handling logic based on error codes, requiring the "code" extension field while hiding all other internal details. + +**How it works**: Configure wrapped mode with `allowed_extension_fields: ["code"]` and `omit_extensions: false`. Only the "code" field from extensions is forwarded to clients. + +**Outcome**: Client applications receive structured error codes for handling logic while sensitive extensions are filtered out. + +### Use Case 3: Full Transparency for Internal APIs +**Scenario**: An internal API used only by trusted backend services needs complete error information for debugging. + +**How it works**: Configure pass-through mode with `allow_all_extension_fields: true` and `attach_service_name: true`. All error details are forwarded including service names. + +**Outcome**: Complete error transparency for internal debugging without the overhead of parsing wrapped errors. + +--- + +## Technical Summary + +### How It Works +When a subgraph returns an error, the router processes it according to the configured propagation mode. In wrapped mode, the original error is encapsulated in a generic message with details placed in the `extensions.errors` array. In pass-through mode, errors are forwarded directly to the client. Both modes apply field filtering based on configuration, stripping unauthorized fields before the response is sent. + +### Key Technical Features + +**Wrapped Mode (Default)** +```yaml +subgraph_error_propagation: + mode: wrapped + allowed_extension_fields: + - "code" +``` + +**Wrapped Mode with Maximum Security** +```yaml +subgraph_error_propagation: + mode: wrapped + omit_extensions: true +``` + +**Wrapped Mode with Extended Information** +```yaml +subgraph_error_propagation: + mode: wrapped + omit_extensions: false + propagate_status_codes: true + attach_service_name: true + allowed_extension_fields: + - "code" +``` + +**Pass-Through Mode with Filtering** +```yaml +subgraph_error_propagation: + mode: pass-through + attach_service_name: true + allow_all_extension_fields: false + allowed_extension_fields: + - "code" + omit_locations: true + allowed_fields: + - "userId" +``` + +**Example Wrapped Error Response** +```json +{ + "errors": [ + { + "message": "Failed to fetch from Subgraph 'employees'.", + "extensions": { + "serviceName": "employees", + "statusCode": 200, + "errors": [ + { + "message": "error resolving field", + "path": ["employees", 0, "name"], + "extensions": { + "code": "ERROR_CODE" + } + } + ] + } + } + ] +} +``` + +### Integration Points +- Client-side error handling logic +- Logging and monitoring systems +- Error tracking services (Sentry, Datadog) +- Development mode configuration + +### Requirements & Prerequisites +- No specific version requirements (available in all modern router versions) +- Understanding of your error handling requirements +- Coordination with frontend teams on error format + +--- + +## Documentation References + +- Primary docs: `/docs/router/subgraph-error-propagation` +- Router configuration: `/docs/router/configuration#subgraph-error-propagation` +- Security hardening: `/docs/router/security/hardening-guide` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL error handling +- Subgraph error propagation +- GraphQL error security + +### Secondary Keywords +- GraphQL error masking +- Error extension filtering +- GraphQL error responses + +### Related Search Terms +- Hide GraphQL error details +- GraphQL error best practices +- Secure GraphQL error handling + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/security/tls-https.md b/capabilities/security/tls-https.md new file mode 100644 index 00000000..5c313072 --- /dev/null +++ b/capabilities/security/tls-https.md @@ -0,0 +1,158 @@ +# TLS/HTTPS + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-sec-003` | +| **Category** | Security | +| **Status** | GA | +| **Availability** | Free, Pro, Enterprise | +| **Related Capabilities** | `cap-sec-005` (Security Hardening) | + +--- + +## Quick Reference + +### Name +TLS/HTTPS + +### Tagline +Encrypted communication with TLS and mutual authentication. + +### Elevator Pitch +Cosmo Router supports TLS and mTLS (mutual TLS) for secure, encrypted communication between clients, the router, and your infrastructure components. Enable HTTPS for client connections, automatic encryption for subgraph communication, and bidirectional certificate verification for zero-trust environments. + +--- + +## Problem & Solution + +### The Problem +GraphQL APIs transmitting sensitive data over unencrypted connections are vulnerable to eavesdropping, man-in-the-middle attacks, and data interception. In modern architectures with load balancers, service meshes, and distributed subgraphs, ensuring end-to-end encryption can be complex. Additionally, many environments require mutual authentication where both client and server verify each other's identity. + +### The Solution +Cosmo Router's TLS support provides simple configuration-driven encryption for all communication channels. Enable TLS for client connections with standard certificate files, automatically secure subgraph connections via HTTPS URLs, and implement mutual TLS for zero-trust architectures. When TLS is enabled, HTTP/2 is automatically available for improved performance. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Unencrypted traffic vulnerable to interception | TLS encryption for all connections | +| Manual HTTP/2 configuration | Automatic HTTP/2 upgrade with TLS | +| Complex mTLS setup | Simple configuration for mutual authentication | +| Inconsistent encryption policies | Centralized TLS configuration at the router | + +--- + +## Key Benefits + +1. **Simple Configuration**: Enable TLS with just a certificate and key file path in YAML configuration, no code changes required. + +2. **Automatic HTTP/2**: When TLS is enabled, requests are automatically upgraded to HTTP/2 for multiplexing, header compression, and improved performance. + +3. **Mutual TLS Support**: Configure client authentication to verify client certificates, enabling zero-trust security where both parties authenticate each other. + +4. **Comprehensive Protocol Support**: Supports TLS 1.0 through 1.3, with a wide range of cipher suites following Go's secure defaults. + +5. **Subgraph Encryption**: Automatic TLS for subgraph connections when using HTTPS URLs, ensuring end-to-end encryption. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / DevOps Engineer +- **Pain Points**: Securing communication between infrastructure components; implementing zero-trust security; enabling HTTP/2 performance benefits +- **Goals**: End-to-end encryption; mutual authentication; compliance with security requirements + +### Secondary Personas +- Security engineers implementing encryption policies +- Architects designing secure network topologies +- Compliance officers ensuring data-in-transit encryption + +--- + +## Use Cases + +### Use Case 1: Secure Load Balancer Communication +**Scenario**: An organization requires encrypted communication between their load balancer and the Cosmo Router to protect traffic within their infrastructure. + +**How it works**: Configure TLS on the router with server certificate and key files. The load balancer connects via HTTPS, and all traffic between the load balancer and router is encrypted. + +**Outcome**: Secure internal communication preventing eavesdropping on traffic between infrastructure components. + +### Use Case 2: Zero-Trust Client Authentication +**Scenario**: A financial services company requires mutual TLS where both clients and the server authenticate each other before establishing connections. + +**How it works**: Enable TLS server configuration with client authentication. Configure `client_auth.required: true` and provide the CA certificate for validating client certificates. Clients must present valid certificates to connect. + +**Outcome**: Bidirectional authentication ensuring only authorized clients can connect to the GraphQL API. + +### Use Case 3: HTTP/2 Performance Optimization +**Scenario**: A high-traffic API needs the performance benefits of HTTP/2, including multiplexing and header compression. + +**How it works**: Enable TLS on the router. HTTP/2 is automatically available when TLS is configured, and clients supporting HTTP/2 are automatically upgraded. + +**Outcome**: Improved API performance through HTTP/2 features, with all connections encrypted. + +--- + +## Technical Summary + +### How It Works +The Cosmo Router TLS implementation uses Go's standard TLS library. When TLS is enabled, the router listens for HTTPS connections using the provided certificate and key files. For subgraph connections, HTTPS is automatically used when subgraph URLs specify the https:// protocol. Client authentication (mTLS) adds an additional handshake step where the client presents its certificate, which the router validates against the configured CA certificate. + +### Key Technical Features +- TLS server configuration with PEM certificate and key files +- Support for TLS 1.0, 1.1, 1.2, and 1.3 +- Comprehensive cipher suite support (ECDHE, GCM, CBC, ChaCha20) +- Optional client authentication for mTLS +- Required mode for enforcing client certificates +- Automatic HTTP/2 upgrade with TLS + +### Integration Points +- Load balancers and reverse proxies +- Client applications with TLS support +- Subgraphs via HTTPS URLs +- Certificate management systems + +### Requirements & Prerequisites +- Router version 0.71.0 or higher +- Valid TLS certificate and private key in PEM format +- For mTLS: CA certificate for client validation +- Client certificates signed by trusted CA (for mTLS) + +--- + +## Documentation References + +- Primary docs: `/docs/router/security/tls` +- Security hardening guide: `/docs/router/security/hardening-guide` +- Router configuration: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL TLS +- GraphQL HTTPS +- mTLS GraphQL + +### Secondary Keywords +- Mutual TLS GraphQL +- HTTP/2 GraphQL +- GraphQL encryption + +### Related Search Terms +- Secure GraphQL API +- GraphQL certificate authentication +- Enable HTTPS GraphQL server + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2025-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/template.md b/capabilities/template.md new file mode 100644 index 00000000..4f598fd5 --- /dev/null +++ b/capabilities/template.md @@ -0,0 +1,289 @@ +# Capability Documentation Template + +Use this template to document each capability in the Cosmo platform. The goal is to provide enough information for marketing, sales, and product teams to create landing pages, pitch decks, battle cards, and other go-to-market materials. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-xxx` (unique identifier) | +| **Category** | (e.g., Federation, Observability, Security, Developer Experience, Operations) | +| **Status** | GA / Beta / Coming Soon | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | (list of related cap-xxx IDs) | + +--- + +## Quick Reference + +### Name + + +### Tagline + + +### Elevator Pitch + + +--- + +## Problem & Solution + +### The Problem + + +### The Solution + + +### Before & After + + +| Before Cosmo | With Cosmo | +|--------------|------------| +| | | + +--- + +## Key Benefits + + + +1. **Benefit Name**: Brief description +2. **Benefit Name**: Brief description +3. **Benefit Name**: Brief description + +--- + +## Target Audience + +### Primary Persona + +- **Role**: (e.g., Platform Engineer, API Developer, Engineering Manager) +- **Pain Points**: What specific challenges do they face? +- **Goals**: What are they trying to achieve? + +### Secondary Personas + + +--- + +## Use Cases + +### Use Case 1: [Name] +**Scenario**: Describe a real-world situation +**How it works**: Step-by-step of how the capability is used +**Outcome**: The result/benefit achieved + +### Use Case 2: [Name] +**Scenario**: +**How it works**: +**Outcome**: + +### Use Case 3: [Name] +**Scenario**: +**How it works**: +**Outcome**: + +--- + +## Competitive Positioning + +### Key Differentiators + +1. +2. +3. + +### Comparison with Alternatives + + +| Aspect | Cosmo | Alternative A | Alternative B | +|--------|-------|---------------|---------------| +| | | | | + +### Common Objections & Responses + + +| Objection | Response | +|-----------|----------| +| | | + +--- + +## Technical Summary + + + +### How It Works + + +### Architecture Diagram + +![Architecture](/images/capabilities/cap-xxx-architecture.png) + +### Key Technical Features +- Feature 1 +- Feature 2 +- Feature 3 + +### Integration Points + +- +- + +### Requirements & Prerequisites + +- +- + +--- + +## Proof Points + +### Metrics & Benchmarks + +- +- + +### Customer Quotes + +> "Quote here" — Customer Name, Title, Company + +### Case Studies + +- + +--- + +## Content Assets + + + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | ✅ Exists / 🔲 Needed | | +| Blog Post | | | +| Video Demo | | | +| Pitch Deck Slide | | | +| One-Pager | | | +| Battle Card | | | + +--- + +## Documentation References + + +- Primary docs: `/docs/path/to/main-doc` +- Configuration guide: `/docs/path/to/config` +- Tutorial: `/docs/path/to/tutorial` +- API Reference: `/docs/path/to/api` + +--- + +## Keywords & SEO + + + +### Primary Keywords +- +- + +### Secondary Keywords +- +- + +### Related Search Terms + +- +- + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| YYYY-MM-DD | 1.0 | Initial capability documentation | + +--- + +# Example: Completed Capability + +Below is an example of a completed capability document for reference. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-001` | +| **Category** | Observability | +| **Status** | GA | +| **Availability** | Pro, Enterprise | +| **Related Capabilities** | `cap-012`, `cap-015` | + +## Quick Reference + +### Name +Distributed Tracing + +### Tagline +Debug federation issues in minutes, not hours. + +### Elevator Pitch +Distributed Tracing provides end-to-end visibility into every GraphQL request as it flows through your federated graph. Instantly identify slow subgraphs, pinpoint errors, and understand the complete request lifecycle—all from a single dashboard. + +## Problem & Solution + +### The Problem +When a GraphQL query fails or performs slowly in a federated architecture, developers waste hours trying to identify which subgraph is responsible. With requests potentially touching dozens of services, traditional logging and monitoring tools lack the context needed to correlate events across service boundaries. + +### The Solution +Cosmo's Distributed Tracing automatically instruments your entire federated graph, capturing detailed timing and context for every operation. Each request gets a unique trace ID that follows it through the Router and into each subgraph, making it trivial to identify exactly where issues occur. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Hours spent correlating logs across services | Single trace view shows complete request path | +| Guessing which subgraph caused latency | Precise timing breakdown per subgraph | +| No visibility into resolver-level performance | Field-level execution insights | + +## Key Benefits + +1. **Reduce MTTR by 80%**: Pinpoint the exact subgraph and resolver causing issues +2. **Proactive Performance Optimization**: Identify slow paths before users complain +3. **Zero-Code Instrumentation**: Works automatically with the Cosmo Router +4. **OpenTelemetry Compatible**: Export traces to your existing observability stack + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: On-call debugging is painful; lack of visibility into federated requests +- **Goals**: Reduce incident response time; maintain SLAs + +### Secondary Personas +- Backend developers debugging performance issues +- Engineering managers tracking system health + +## Use Cases + +### Use Case 1: Production Incident Response +**Scenario**: A critical checkout API starts returning errors intermittently +**How it works**: Filter traces by error status, see the exact subgraph returning errors, view the error message and stack trace +**Outcome**: Root cause identified in 5 minutes instead of 2 hours + +### Use Case 2: Performance Optimization +**Scenario**: Users report slow page loads on the product catalog +**How it works**: Analyze traces for the product query, identify the inventory subgraph adding 800ms latency, drill into specific resolver +**Outcome**: Targeted optimization reduces latency by 60% + +## Documentation References + +- Primary docs: `/docs/studio/tracing` +- Configuration guide: `/docs/router/observability/tracing` +- Tutorial: `/docs/tutorial/observability-setup` diff --git a/capabilities/traffic-management/circuit-breaker.md b/capabilities/traffic-management/circuit-breaker.md new file mode 100644 index 00000000..9f61f4f8 --- /dev/null +++ b/capabilities/traffic-management/circuit-breaker.md @@ -0,0 +1,238 @@ +# Circuit Breaker + +Fault tolerance with automatic circuit state management for federated GraphQL. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-traffic-004` | +| **Category** | Traffic Management | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-traffic-001`, `cap-traffic-002`, `cap-traffic-003` | + +--- + +## Quick Reference + +### Name +Circuit Breaker + +### Tagline +Prevent cascading failures with automatic circuit protection. + +### Elevator Pitch +Cosmo's Circuit Breaker protects your federated GraphQL API from cascading failures. When a subgraph starts failing, the circuit breaker automatically stops sending requests to it, giving it time to recover. Once healthy, traffic is gradually restored. This pattern keeps your router responsive during partial outages and prevents a single failing service from bringing down your entire API. + +--- + +## Problem & Solution + +### The Problem +In federated architectures, a single failing subgraph can cascade failures across your entire system. When a service becomes slow or unresponsive, requests pile up, resources are exhausted, and eventually the entire router becomes unresponsive. Without circuit breakers, your system has no automatic protection against this cascading failure pattern. + +### The Solution +Cosmo's Circuit Breaker implements the proven circuit breaker pattern, automatically detecting when subgraphs are failing and stopping traffic to them. Using a time-based sliding window, the circuit breaker tracks error rates and opens when thresholds are exceeded. After a configurable sleep window, it cautiously tests if the service has recovered before fully restoring traffic. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Failing subgraph brings down entire API | Failed service is automatically isolated | +| Requests pile up waiting for slow services | Circuit opens instantly, freeing resources | +| Manual intervention needed to stop traffic | Automatic protection triggers on threshold | +| Hard to know when service has recovered | Gradual recovery testing with half-open state | + +--- + +## Key Benefits + +1. **Automatic Failure Detection**: Time-based sliding window with configurable thresholds detects when subgraphs are failing. + +2. **Instant Protection**: When a circuit opens, requests are immediately rejected without waiting for timeouts, keeping your router responsive. + +3. **Graceful Recovery**: The half-open state cautiously tests if services have recovered before fully restoring traffic. + +4. **Per-Subgraph Configuration**: Different services can have different circuit breaker settings based on their reliability characteristics. + +5. **Observable State**: Metrics and monitoring provide visibility into circuit breaker state for operational awareness. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: Cascading failures during outages; manual intervention to stop traffic to failing services; lack of automatic recovery +- **Goals**: Automatic failure isolation; fast recovery from partial outages; maintain overall system stability + +### Secondary Personas +- DevOps engineers managing production stability +- Engineering managers responsible for system reliability + +--- + +## Use Cases + +### Use Case 1: Protecting Against Cascading Failures +**Scenario**: A subgraph's database becomes overloaded, causing the service to respond slowly or not at all. +**How it works**: As requests fail, the circuit breaker's sliding window tracks the error rate. When the error threshold is exceeded (e.g., 50% errors over 60 seconds), the circuit opens. All subsequent requests to that subgraph are immediately rejected, freeing router resources. +**Outcome**: The failing subgraph is isolated; other subgraphs continue working; the router remains responsive. + +### Use Case 2: Automatic Recovery After Outage +**Scenario**: After a circuit opens due to failures, the underlying issue is resolved and the subgraph is healthy again. +**How it works**: After the sleep window expires (e.g., 30 seconds), the circuit enters half-open state. A limited number of test requests are allowed through. If enough succeed (e.g., 3 out of 5), the circuit closes and normal traffic resumes. +**Outcome**: Traffic is automatically restored without manual intervention; gradual recovery prevents overwhelming the just-recovered service. + +### Use Case 3: Critical Service Protection +**Scenario**: A payment processing subgraph is critical and must be protected from retry storms during incidents. +**How it works**: Configure a subgraph-specific circuit breaker with conservative settings: low request threshold, quick sleep window, and strict success requirements. This ensures fast protection with careful recovery. +**Outcome**: The payment service is protected aggressively, with careful recovery to prevent re-triggering issues. + +### Use Case 4: Disabling Circuit Breaker for Testing +**Scenario**: During development or testing, you want to see all failures rather than having them circuit-broken. +**How it works**: Disable the circuit breaker for specific subgraphs using `enabled: false` in the subgraph-specific configuration. +**Outcome**: Test subgraphs show all failures for debugging while production subgraphs remain protected. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Time-based sliding window with configurable buckets for accurate error rate tracking +2. Three-state model (closed, open, half-open) with configurable transitions +3. Per-URL circuit breaker grouping with subgraph-specific overrides +4. Integrated metrics for operational visibility + +### Comparison with Alternatives + +| Aspect | Cosmo | Service Mesh | Custom Implementation | +|--------|-------|--------------|----------------------| +| Configuration | Simple YAML | Complex CRDs | Code in each service | +| Sliding window | Time-based buckets | Varies | Must implement | +| Half-open testing | Configurable attempts | Usually fixed | Must implement | +| Metrics integration | Built-in | Separate | Must implement | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Circuit breakers can hide real problems" | Metrics and monitoring make circuit state visible; alerts can be set on open circuits | +| "We need different settings per service" | Per-subgraph configuration allows complete customization | +| "What about services sharing URLs?" | Subgraph-specific config creates dedicated circuit breakers even for shared URLs | + +--- + +## Technical Summary + +### How It Works +The circuit breaker uses a time-based sliding window divided into buckets to track request outcomes. When the error rate exceeds the configured threshold and minimum request count is met, the circuit opens. All requests are rejected until the sleep window expires. Then, a half-open state allows limited test requests. If enough succeed, the circuit closes; if any fail, it reopens. + +### Circuit States +- **Closed**: Normal operation; requests pass through; errors are tracked +- **Open**: Protection mode; all requests immediately rejected +- **Half-Open**: Recovery testing; limited requests allowed to test service health + +### Key Technical Features +- Time-based sliding window with configurable buckets +- Error rate and request count thresholds +- Configurable sleep window before recovery testing +- Half-open state with configurable test attempts and success requirements +- Execution timeout for circuit breaker error tracking +- Per-subgraph configuration overrides + +### What Counts as a Failure +- Network-level failures (connection refused, DNS errors, TLS failures) +- Transport errors (broken connections, read/write timeouts) +- Execution timeouts (configurable circuit breaker-specific timeout) + +### What Does NOT Count as a Failure +- HTTP error status codes (4xx, 5xx) if a response is received +- Request cancellations or client-side timeouts + +### Configuration Example +```yaml +traffic_shaping: + all: + circuit_breaker: + enabled: true + request_threshold: 20 + error_threshold_percentage: 50 + sleep_window: 30s + half_open_attempts: 5 + required_successful: 3 + rolling_duration: 60s + num_buckets: 10 +``` + +### Integration Points +- Router configuration (config.yaml) +- Retry mechanism (retries stop when circuit opens) +- Metrics and monitoring (circuit breaker state and events) + +### Requirements & Prerequisites +- Cosmo Router deployed +- Access to router configuration +- Rolling duration must be evenly divisible by number of buckets + +--- + +## Proof Points + +### Metrics & Benchmarks +- Circuits open within the configured rolling window when thresholds are exceeded +- Half-open state provides controlled recovery testing +- Per-subgraph circuits isolate failures to specific services + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/traffic-shaping/circuit-breaker` +- Traffic shaping overview: `/docs/router/traffic-shaping` +- Configuration reference: `/docs/router/configuration#circuit-breaker` +- Metrics: `/docs/router/metrics-and-monitoring#circuit-breaker-specific-metrics` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL circuit breaker +- Federation fault tolerance +- API circuit breaker pattern + +### Secondary Keywords +- Cascading failure prevention +- Half-open circuit breaker +- Sliding window error rate + +### Related Search Terms +- How to configure GraphQL circuit breaker +- Preventing cascading failures in GraphQL +- Federation subgraph failure isolation +- Circuit breaker pattern implementation + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2026-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/traffic-management/retry-mechanism.md b/capabilities/traffic-management/retry-mechanism.md new file mode 100644 index 00000000..d76fd423 --- /dev/null +++ b/capabilities/traffic-management/retry-mechanism.md @@ -0,0 +1,216 @@ +# Retry Mechanism + +Intelligent retry policies with exponential backoff for federated GraphQL. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-traffic-002` | +| **Category** | Traffic Management | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-traffic-001`, `cap-traffic-003`, `cap-traffic-004` | + +--- + +## Quick Reference + +### Name +Retry Mechanism + +### Tagline +Recover from transient failures automatically. + +### Elevator Pitch +The Cosmo Router's Retry Mechanism automatically retries failed GraphQL queries using intelligent exponential backoff with jitter. Configure retry conditions using expressions, set maximum attempts and durations, and let the router handle transient network failures and temporary service unavailability—all without changing your subgraph code. + +--- + +## Problem & Solution + +### The Problem +Transient network failures, temporary service unavailability, and brief outages are facts of life in distributed systems. Without automatic retries, these temporary issues result in failed requests that frustrate users. Implementing retry logic in every service leads to inconsistent behavior, duplicated code, and the risk of retry storms that make problems worse. + +### The Solution +Cosmo's Retry Mechanism provides centralized, configurable retry logic at the router level. The router automatically retries failed query operations using the proven "Backoff and Jitter" algorithm, preventing retry storms while maximizing recovery from transient failures. Expression-based conditions let you precisely control when retries should occur. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Retry logic scattered across services | Centralized configuration at the router | +| Inconsistent retry behavior | Uniform retry policy with per-subgraph overrides | +| Risk of retry storms | Exponential backoff with jitter prevents thundering herd | +| Hard to tune retry conditions | Expression-based conditions for precise control | + +--- + +## Key Benefits + +1. **Automatic Failure Recovery**: Transient failures are automatically retried without user-visible errors, improving perceived reliability. + +2. **Backoff with Jitter**: The proven "Backoff and Jitter" algorithm (as recommended by AWS) prevents retry storms and distributes retry load over time. + +3. **Expression-Based Conditions**: Use powerful expressions to control exactly when retries occur—by status code, error type, or custom conditions. + +4. **Safe by Default**: Only queries are retried (mutations are not), preventing accidental duplicate operations on non-idempotent endpoints. + +5. **Configurable Limits**: Set maximum attempts, intervals, and durations to bound retry behavior and prevent runaway retries. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: Transient failures causing user-visible errors; implementing consistent retry logic across services; preventing retry storms +- **Goals**: Maximize availability; reduce user-facing errors; maintain consistent retry behavior + +### Secondary Personas +- Backend developers who want reliability without implementing retry logic +- DevOps engineers tuning system behavior during incidents + +--- + +## Use Cases + +### Use Case 1: Recovering from Network Blips +**Scenario**: A subgraph experiences brief network connectivity issues that cause occasional request failures. +**How it works**: With default retry configuration, the router automatically detects connection errors and retries the query using exponential backoff. Helper functions like `IsConnectionError()` and `IsTimeout()` identify retryable conditions. +**Outcome**: Users don't see errors from transient network issues; the retry succeeds transparently. + +### Use Case 2: Handling Subgraph Restarts +**Scenario**: During a deployment, a subgraph briefly returns 503 Service Unavailable while new pods start. +**How it works**: The default expression `IsRetryableStatusCode()` includes 503, so the router automatically retries these requests. With up to 5 attempts over 10 seconds, most requests succeed once new pods are ready. +**Outcome**: Deployments don't cause user-visible errors; requests are automatically routed to healthy instances. + +### Use Case 3: Custom Retry for Rate-Limited APIs +**Scenario**: A subgraph wraps an external API that occasionally returns 429 Too Many Requests with a Retry-After header. +**How it works**: Configure a custom expression that includes `statusCode == 429`. When the subgraph returns 429, the router respects the Retry-After header (if present) or uses the configured backoff interval. +**Outcome**: Rate-limited requests are automatically retried after the appropriate delay, maximizing throughput while respecting API limits. + +### Use Case 4: Excluding Slow Business Logic from Retries +**Scenario**: A subgraph performs complex calculations that can legitimately take a long time, and you don't want to retry these even on timeout. +**How it works**: Configure a custom expression like `!IsHttpReadTimeout() && IsTimeout()` to exclude HTTP read timeouts from retry conditions while still retrying on connection-level timeouts. +**Outcome**: Genuine slow operations aren't wastefully retried, but connection issues still trigger appropriate retries. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Expression-based retry conditions provide sophisticated control without code changes +2. GraphQL-aware: only retries queries, not mutations +3. Respects Retry-After headers for 429 responses when enabled +4. Integrated with circuit breakers for comprehensive failure handling + +### Comparison with Alternatives + +| Aspect | Cosmo | Generic HTTP Retry | Service Mesh | +|--------|-------|-------------------|--------------| +| GraphQL-aware | Yes (query vs mutation) | No | No | +| Condition expressions | Yes | Usually no | Limited | +| Backoff algorithm | Jitter built-in | Varies | Varies | +| 429 Retry-After support | Yes | Varies | Varies | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "Retries can cause duplicate requests" | Mutations are never retried; only idempotent queries are automatically retried | +| "We need custom retry logic per service" | Per-subgraph configuration allows different retry settings for each service | +| "How do we debug retry behavior?" | Enable debug mode to see retry attempts in logs | + +--- + +## Technical Summary + +### How It Works +When a GraphQL query fails due to a retryable condition (network error, specific status codes), the router automatically retries using the configured backoff algorithm. The default algorithm is "Backoff and Jitter" which increases the delay between retries and adds random jitter to prevent thundering herd effects. An expression is evaluated for each failure to determine if a retry should be attempted. + +### Key Technical Features +- Backoff and Jitter algorithm (AWS-recommended pattern) +- Configurable max attempts, interval, and max duration +- Expression-based retry conditions with helper functions +- Default retries on 502, 503, 504 and connection errors +- Respects Retry-After header for 429 responses (when enabled) +- Automatically retries on "unexpected EOF" errors + +### Helper Functions +- `IsRetryableStatusCode()`: Returns true for 500, 502, 503, 504 +- `IsConnectionError()`: Connection refused, reset, DNS, TLS failures +- `IsTimeout()`: Any timeout error (HTTP, network, deadline exceeded) +- `IsHttpReadTimeout()`: Specifically HTTP read timeouts +- `IsConnectionRefused()`: ECONNREFUSED errors +- `IsConnectionReset()`: ECONNRESET errors + +### Integration Points +- Router configuration (config.yaml) +- Circuit breakers (retries stop when circuit opens) +- Debug logging for retry visibility + +### Requirements & Prerequisites +- Cosmo Router deployed +- Access to router configuration + +--- + +## Proof Points + +### Metrics & Benchmarks +- Default configuration recovers from most transient failures (5 attempts, 10s max) +- Backoff with jitter prevents retry storms during outages +- Debug mode provides full visibility into retry attempts + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/traffic-shaping/retry` +- Traffic shaping overview: `/docs/router/traffic-shaping` +- Configuration reference: `/docs/router/configuration` +- Template expressions: `/docs/router/configuration/template-expressions` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL retry mechanism +- Federation retry configuration +- API retry with backoff + +### Secondary Keywords +- Exponential backoff jitter +- Automatic failure recovery +- Retry expression conditions + +### Related Search Terms +- How to configure GraphQL retries +- Backoff and jitter algorithm +- Federation transient failure handling +- GraphQL 503 retry + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2026-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/traffic-management/timeout-configuration.md b/capabilities/traffic-management/timeout-configuration.md new file mode 100644 index 00000000..5e3dea35 --- /dev/null +++ b/capabilities/traffic-management/timeout-configuration.md @@ -0,0 +1,220 @@ +# Timeout Configuration + +Request and per-subgraph timeout management for federated GraphQL. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-traffic-003` | +| **Category** | Traffic Management | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-traffic-001`, `cap-traffic-002`, `cap-traffic-004` | + +--- + +## Quick Reference + +### Name +Timeout Configuration + +### Tagline +Fine-grained timeout control for every subgraph. + +### Elevator Pitch +Cosmo's Timeout Configuration gives you precise control over how long the router waits for subgraph responses at every stage of the request lifecycle. Set global defaults for all subgraphs, then override specific services that need different treatment. From connection dial to TLS handshake to full request completion, every timeout is configurable. + +--- + +## Problem & Solution + +### The Problem +In federated GraphQL architectures, different subgraphs have different performance characteristics. A product catalog might respond in milliseconds, while an inventory check calls a slow legacy system. Without granular timeout control, teams either set timeouts too high (wasting resources on hung connections) or too low (failing legitimate slow requests). One-size-fits-all timeout settings don't work for diverse subgraph ecosystems. + +### The Solution +Cosmo's Timeout Configuration provides multiple timeout controls at different stages of the request lifecycle. Set global defaults that apply to all subgraphs, then override specific subgraphs that need different treatment. Configure everything from connection dial timeouts to full request timeouts, ensuring each subgraph gets the time it needs—no more, no less. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Single timeout for all services | Multiple timeout types for different stages | +| Same timeout for fast and slow services | Per-subgraph timeout overrides | +| Can't distinguish dial vs. request timeout | Separate dial, TLS, request, and response timeouts | +| Hung connections waste resources | Keep-alive management reclaims idle connections | + +--- + +## Key Benefits + +1. **Multi-Stage Timeout Control**: Configure separate timeouts for connection dial, TLS handshake, response headers, and full request completion. + +2. **Per-Subgraph Overrides**: Set sensible global defaults, then customize timeouts for specific subgraphs that have different performance characteristics. + +3. **Resource Efficiency**: Keep-alive idle timeout automatically closes idle connections, preventing resource exhaustion. + +4. **Slow Service Accommodation**: Increase request timeout for known slow services without affecting the timeout budget for fast services. + +5. **Zero-Downtime Configuration**: All timeout settings are in the router's YAML config, changeable without code deployments. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: Balancing tight timeouts for responsiveness with generous timeouts for slow services; managing connection resources; debugging timeout issues +- **Goals**: Optimize timeout settings per service; maintain responsiveness; prevent resource exhaustion + +### Secondary Personas +- Backend developers integrating slow legacy systems as subgraphs +- DevOps engineers optimizing connection pool behavior + +--- + +## Use Cases + +### Use Case 1: Accommodating Legacy Backend Integration +**Scenario**: A subgraph wraps a legacy system that can take up to 60 seconds for complex queries, but the default request timeout is 10 seconds. +**How it works**: Add a subgraph-specific override with `request_timeout: 60s` for just that subgraph while keeping the default tight for other services. +**Outcome**: The legacy subgraph gets the time it needs without affecting timeout behavior for other services. + +### Use Case 2: Optimizing Connection Establishment +**Scenario**: Subgraphs are deployed in a different region, and connection establishment occasionally takes longer than expected. +**How it works**: Increase `dial_timeout` to give connections more time to establish across the network, while keeping the TLS handshake timeout tight. +**Outcome**: Cross-region connections succeed reliably without overly generous timeouts on other stages. + +### Use Case 3: Managing Connection Pool Resources +**Scenario**: Under variable load, many idle connections accumulate and consume resources. +**How it works**: Configure `keep_alive_idle_timeout` to automatically close connections that have been idle for a specified duration, freeing up resources. +**Outcome**: Connection resources are automatically reclaimed during low-traffic periods. + +### Use Case 4: Detecting Slow Response Headers +**Scenario**: A subgraph occasionally hangs after accepting the connection but before sending response headers. +**How it works**: Set `response_header_timeout` to detect when a subgraph accepts the request but fails to start responding within a reasonable time. +**Outcome**: Hung requests are detected early, freeing up router resources and triggering potential retries. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Multiple timeout types for different request lifecycle stages +2. Per-subgraph configuration overrides for heterogeneous service landscapes +3. Integrated with retry and circuit breaker for comprehensive traffic management +4. Keep-alive management for connection resource optimization + +### Comparison with Alternatives + +| Aspect | Cosmo | Generic Proxy | Service Mesh | +|--------|-------|---------------|--------------| +| Timeout granularity | 7 different types | Usually 1-2 | Varies | +| Per-service override | Yes | Often manual | Yes | +| Keep-alive management | Built-in | Varies | Varies | +| GraphQL integration | Native | No | No | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We don't need so many timeout types" | Use the defaults; configure only what you need. The granularity is there when you need it. | +| "How do we know what values to set?" | Start with defaults, use observability to identify timeout issues, then tune specific values | +| "Per-subgraph config seems complex" | Most use cases only need global defaults; per-subgraph is for exceptions | + +--- + +## Technical Summary + +### How It Works +Timeout configuration is specified in the router's YAML config file under `traffic_shaping`. The `all` section sets defaults for all subgraph requests. The `subgraphs` section allows overriding any timeout for specific subgraphs. The router enforces these timeouts at runtime, failing requests that exceed their configured limits. + +### Key Technical Features +- `request_timeout`: Maximum total time for the complete request lifecycle +- `dial_timeout`: Maximum time to establish a connection +- `tls_handshake_timeout`: Maximum time for TLS negotiation +- `response_header_timeout`: Maximum time to receive response headers +- `expect_continue_timeout`: Time to wait for 100-continue response +- `keep_alive_idle_timeout`: Time before closing idle connections +- `keep_alive_probe_interval`: Interval between keep-alive probes + +### Configuration Example +```yaml +traffic_shaping: + all: + request_timeout: 60s + dial_timeout: 30s + tls_handshake_timeout: 10s + subgraphs: + legacy-service: + request_timeout: 120s # Override for slow service +``` + +### Integration Points +- Router configuration (config.yaml) +- Retry mechanism (timeouts can trigger retries) +- Circuit breaker (timeout failures affect circuit state) + +### Requirements & Prerequisites +- Cosmo Router deployed +- Access to router configuration + +--- + +## Proof Points + +### Metrics & Benchmarks +- Granular timeouts prevent resource waste on hung connections +- Per-subgraph configuration accommodates diverse service characteristics +- Keep-alive management optimizes connection pool resource usage + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/traffic-shaping/timeout` +- Traffic shaping overview: `/docs/router/traffic-shaping` +- Configuration reference: `/docs/router/configuration` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL timeout configuration +- Federation subgraph timeout +- API request timeout + +### Secondary Keywords +- Connection timeout settings +- Keep-alive configuration +- TLS handshake timeout + +### Related Search Terms +- How to configure GraphQL timeouts +- Federation slow service timeout +- Router connection management +- Subgraph request timeout + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2026-01-14 | 1.0 | Initial capability documentation | diff --git a/capabilities/traffic-management/traffic-shaping.md b/capabilities/traffic-management/traffic-shaping.md new file mode 100644 index 00000000..d68e03da --- /dev/null +++ b/capabilities/traffic-management/traffic-shaping.md @@ -0,0 +1,202 @@ +# Traffic Shaping + +Comprehensive traffic control for federated GraphQL APIs. + +--- + +## Metadata + +| Field | Value | +|-------|-------| +| **Capability ID** | `cap-traffic-001` | +| **Category** | Traffic Management | +| **Status** | GA | +| **Availability** | Free / Pro / Enterprise | +| **Related Capabilities** | `cap-traffic-002`, `cap-traffic-003`, `cap-traffic-004` | + +--- + +## Quick Reference + +### Name +Traffic Shaping + +### Tagline +Take control of router traffic for maximum reliability. + +### Elevator Pitch +Traffic Shaping provides comprehensive control over how the Cosmo Router manages traffic between clients and the router, and between the router and subgraphs. Configure retries, timeouts, and circuit breakers to build resilient federated GraphQL APIs that gracefully handle failures and maintain performance under load. + +--- + +## Problem & Solution + +### The Problem +In distributed federated GraphQL architectures, network failures, slow services, and cascading outages are inevitable. Without proper traffic management, a single failing subgraph can bring down your entire API. Teams struggle to configure appropriate timeouts, implement intelligent retry logic, and protect their systems from cascading failures—leading to poor user experiences and increased operational burden. + +### The Solution +Cosmo's Traffic Shaping provides a unified configuration layer for managing all aspects of traffic between your router and subgraphs. Set global defaults that apply to all subgraphs, then override settings for specific services that need different treatment. The router handles retry logic with exponential backoff, enforces timeouts at multiple levels, and can automatically circuit-break failing services. + +### Before & After + +| Before Cosmo | With Cosmo | +|--------------|------------| +| Manual retry logic in each service | Centralized retry configuration with intelligent backoff | +| Inconsistent timeout settings across services | Unified timeout management with per-subgraph overrides | +| Cascading failures when one service goes down | Automatic circuit breaking isolates failing services | +| Complex traffic management code scattered everywhere | Single YAML configuration for all traffic rules | + +--- + +## Key Benefits + +1. **Unified Traffic Control**: Configure retries, timeouts, and circuit breakers from a single configuration file with consistent semantics across all subgraphs. + +2. **Increased Reliability**: Intelligent retry mechanisms with exponential backoff and jitter automatically recover from transient failures without overwhelming downstream services. + +3. **Failure Isolation**: Circuit breakers prevent cascading failures by automatically stopping requests to unhealthy subgraphs, allowing them time to recover. + +4. **Granular Control**: Apply default rules globally while overriding specific settings for individual subgraphs that need different treatment. + +5. **Zero Code Changes**: All traffic shaping is configured at the router level—no changes required to your subgraph implementations. + +--- + +## Target Audience + +### Primary Persona +- **Role**: Platform Engineer / SRE +- **Pain Points**: Managing reliability across multiple subgraphs; preventing cascading failures; tuning retry and timeout settings without code changes +- **Goals**: Build resilient federated APIs; reduce incident frequency; maintain SLAs during partial outages + +### Secondary Personas +- Backend developers building subgraph services who need consistent reliability patterns +- DevOps engineers responsible for production stability +- Engineering managers tracking system reliability metrics + +--- + +## Use Cases + +### Use Case 1: Global Reliability Configuration +**Scenario**: A platform team needs to establish baseline reliability settings for all subgraphs in their federated graph. +**How it works**: Configure traffic shaping in the router's config.yaml with an `all` section that sets default retry attempts, timeouts, and circuit breaker thresholds for every subgraph. +**Outcome**: All subgraphs automatically benefit from intelligent retry logic and failure protection without any per-service configuration. + +### Use Case 2: Slow Service Accommodation +**Scenario**: One subgraph calls a legacy backend that occasionally takes 30+ seconds to respond, but the default timeout is 10 seconds. +**How it works**: Add a subgraph-specific override in the `subgraphs` section with an increased `request_timeout` value for just that service. +**Outcome**: The slow service gets the time it needs while other subgraphs maintain tight timeout budgets. + +### Use Case 3: High-Traffic Service Protection +**Scenario**: A critical subgraph handles payment processing and must be protected from retry storms during incidents. +**How it works**: Configure a subgraph-specific circuit breaker with conservative thresholds—lower request threshold, quick sleep window, and strict success requirements for recovery. +**Outcome**: The payment service is protected from being overwhelmed during failures, and recovers gracefully when the underlying issue is resolved. + +--- + +## Competitive Positioning + +### Key Differentiators +1. Unified configuration for all traffic management aspects (retries, timeouts, circuit breakers) in a single YAML file +2. Expression-based retry conditions allow sophisticated logic without code changes +3. Per-subgraph overrides provide granular control while maintaining sensible defaults + +### Comparison with Alternatives + +| Aspect | Cosmo | Service Mesh | Custom Implementation | +|--------|-------|--------------|----------------------| +| Configuration complexity | Simple YAML | Complex CRDs | Code in each service | +| Retry logic | Built-in with expressions | Varies | Must implement | +| Circuit breakers | Integrated | Separate config | Must implement | +| GraphQL-aware | Yes | No | Varies | + +### Common Objections & Responses + +| Objection | Response | +|-----------|----------| +| "We already have a service mesh" | Cosmo's traffic shaping is GraphQL-aware and works at the query level, complementing mesh capabilities | +| "Our services handle their own retries" | Centralizing retry logic at the router provides consistency and reduces duplicate code across services | +| "Configuration seems complex" | Start with sensible defaults in the `all` section; only add overrides when needed | + +--- + +## Technical Summary + +### How It Works +Traffic shaping is configured in the router's YAML configuration file. The `all` section defines defaults applied to every subgraph request. The `subgraphs` section allows per-service overrides. At runtime, the router applies these rules to every request flowing between the router and subgraphs, handling retries, enforcing timeouts, and managing circuit breaker state automatically. + +### Key Technical Features +- Exponential backoff with jitter for retries +- Multiple timeout types (request, dial, TLS handshake, response header) +- Time-based sliding window circuit breakers +- Expression-based retry conditions +- Per-subgraph configuration overrides + +### Integration Points +- Router configuration (config.yaml) +- Observability stack (metrics for circuit breaker state) +- Subgraph services (transparent—no changes required) + +### Requirements & Prerequisites +- Cosmo Router deployed +- Access to router configuration + +--- + +## Proof Points + +### Metrics & Benchmarks +- Retry mechanism can recover from transient failures without client-visible errors +- Circuit breakers can prevent request pile-up during subgraph outages +- Configurable at runtime without router restart (via config reload) + +--- + +## Content Assets + +| Asset Type | Status | Link | +|------------|--------|------| +| Landing Page | Needed | | +| Blog Post | Needed | | +| Video Demo | Needed | | +| Pitch Deck Slide | Needed | | +| One-Pager | Needed | | +| Battle Card | Needed | | + +--- + +## Documentation References + +- Primary docs: `/docs/router/traffic-shaping` +- Retry configuration: `/docs/router/traffic-shaping/retry` +- Timeout configuration: `/docs/router/traffic-shaping/timeout` +- Circuit breaker: `/docs/router/traffic-shaping/circuit-breaker` + +--- + +## Keywords & SEO + +### Primary Keywords +- GraphQL traffic shaping +- Federation reliability +- API traffic management + +### Secondary Keywords +- GraphQL retries +- Subgraph timeouts +- Circuit breaker pattern + +### Related Search Terms +- How to configure GraphQL API retries +- Federation timeout configuration +- Preventing cascading failures in GraphQL +- GraphQL router reliability + +--- + +## Version History + +| Date | Version | Changes | +|------|---------|---------| +| 2026-01-14 | 1.0 | Initial capability documentation |