Skip to content

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Dec 12, 2025

PR for NUTCH-3134. Notably, this PR introduces a new Class named LatencyTracker.java which tracks latency metrics. The implementation wraps the TDigest data structure to collect latency samples and emit Hadoop counters with count, sum, and percentile values (p50, p95, p99). Note this is limited to Fetcher, Parser and Indexer jobs right now but could certainly be extended to other jobs in the future.

One note for any reviewers, please sanity check

  1. latency start ands stop boundaries are accurate.
  2. counters are emitted at the correct times.

Thanks for any review. Local testing is favorable. My next step will be to share my WIP Nutch observability solution via user@ .

@lewismc lewismc self-assigned this Dec 12, 2025
Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! And works.

Counters from a small local test crawl (content served from a local Apache httpd):

fetch_latency_count_total=27
fetch_latency_p50_ms=13
fetch_latency_p95_ms=28
fetch_latency_p99_ms=61
fetch_latency_sum_ms=421
``

@lewismc lewismc merged commit ca2591e into apache:master Dec 18, 2025
6 checks passed
@lewismc lewismc deleted the NUTCH-3134 branch December 18, 2025 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants