Skip to content

WIP Prometheus without operator#67

Open
solsson wants to merge 8 commits intomainfrom
metrics-v2-experiment
Open

WIP Prometheus without operator#67
solsson wants to merge 8 commits intomainfrom
metrics-v2-experiment

Conversation

@solsson
Copy link
Collaborator

@solsson solsson commented Mar 10, 2026

We can most likely meet our metrics endpoints discovery need using conventions and kubernetes_sd_config.

While we're at it we should revisit long term storage and querying.

Copy link
Collaborator Author

@solsson solsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes (I created the PR so I can't reject it formally).

Only core changes reviewed. I will look at the benchmark setup for long term storage later.

- OpenMetricsText0.0.1
- PrometheusProto
- PrometheusText1.0.0
- PrometheusText0.0.4
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all of these? I'd prefer we avoid legacy versions.

expr: >-
sum(instance_cpu:node_cpu_top:rate5m) without (mode, cpu)
/
sum(rate(node_cpu_seconds_total[5m])) without (mode, cpu)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the community source for these rules?

metric_relabel_configs:
- source_labels: [__name__]
regex: kube_replicaset_status_observed_generation
action: drop
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must do service discovery using conventions + labels. Make sure that ystack uses port names along with modern community standards for prometheus discovery, then update SD config so that it has no specific targets. I'm fine with more than one SD config as long as it's clear how a pod in any namespace can match it. Also ServiceMonitor sometimes is a use case so we need an example of that in ystack.

Yolean macbot01 and others added 7 commits March 11, 2026 09:57
Remove all monitoring.coreos.com CRDs (Prometheus, Alertmanager,
ServiceMonitor, PodMonitor, PrometheusRule) and replace with plain
Kubernetes Deployments, ConfigMaps, and scrape config.

Prometheus now uses kubernetes_sd_configs for target discovery
instead of operator-managed ServiceMonitor/PodMonitor CRDs.
Recording rules moved from PrometheusRule CRD to a ConfigMap-mounted
rules file. A configmap-reload sidecar triggers /-/reload on changes.

Consolidates k3s/30-monitoring-operator + k3s/31-monitoring into
a single k3s/30-monitoring base. Updates converge and validate
scripts accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
v0.31.0 was not available in the container registry at experiment
time. Revert this commit to restore v0.31.0 once it is published.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploy Thanos Receive (StatefulSet) + Query (Deployment) and
GreptimeDB standalone as competing remote_write backends for the
metrics-v2 experiment. Prometheus sends scraped metrics to both
via remote_write for side-by-side comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thanos wins 8.35 vs 8.00 over GreptimeDB on weighted criteria:
query correctness, operational complexity, resource usage, maturity,
and storage cost projection. All PromQL queries returned consistent
results across all three backends. Documents deviations from the
original experiment plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both backends now write to versitygw object storage for storage cost
comparison. Adds bucket-create jobs and S3 configuration for each.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
WARNING comment included: these overrides should not be used in
production. Forces frequent block cuts so S3 uploads are visible
quickly during the metrics-v2 experiment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both backends now write to versitygw. GreptimeDB's columnar format
produces 5.6x less data (252 KB vs 1.4 MB) for the same metrics
workload. This flips the storage cost score and brings the weighted
totals to a near-tie (Thanos 8.05 vs GreptimeDB 8.30).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@solsson solsson force-pushed the metrics-v2-experiment branch from 27a6b73 to 7e3d067 Compare March 11, 2026 08:57
for provisioners that use a fixed IP.

Use y-k8s-ingress-hosts -check before attempting -write, so provision
can complete without a TTY or sudo when entries already exist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@solsson solsson force-pushed the metrics-v2-experiment branch from 7191acd to 77af594 Compare March 12, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant