Skip to content

"impossible" results with graylog-plugin-metrics/graphite #7

@rbroemeling

Description

@rbroemeling

We have just started using graylog-plugin-metrics to relay metrics to our graphite instance, and we're finding extremely curious results, like this (the times on this are at GMT-0600, so 10:00 on the graph is 16:00 GMT):

request_time_vs_upstream_response_time

The graph is built from graylog absorption of nginx access.log entries, which have request_time and upstream_response_time fields. Since the upstream response time must be completely contained within the request time, request_time should be strictly greater than upstream_response_time (it is not possible for a request to have a greater request_time than an upstream_response_time).

Note that it is explicitly possible for a request to not have an upstream_response_time (if it is nginx serving a static file), but we've explicitly adjusted for that -- the graylog stream that we are dealing with restricts events to only those that include an upstream_response_time.

I've validated that within graylog itself, the reality holds and that upstream_response_time is always less than request_time:

From 16:07 to 16:08, mean request_time is 193.43 and mean upstream_response_time is 175.78.
From 16:08 to 16:09, mean request_time is 190.33 and mean upstream_response_time is 163.96.
From 16:09 to 16:10, mean request_time is 219.43 and mean upstream_response_time is 183.53.
From 16:10 to 16:11, mean request_time is 215.08 and mean upstream_response_time is 186.41.

... thus, the data in graylog looks good. The request_time and upstream_response_time fields are sent (as histograms, via graylog-plugin-metrics) to graphite, and result in the following graphite data, however:

$ whisper-fetch.py --from=1458230820 --until=1458231060 request_time/mean.wsp 
1458230880  192.900000
1458230940  152.620000
1458231000  221.990000
1458231060  206.160000

$ whisper-fetch.py --from=1458230820 --until=1458231060 upstream_response_time/mean.wsp 
1458230880  171.720000
1458230940  223.010000
1458231000  202.560000
1458231060  191.670000

As you can see, the data for 1458230940 (16:09 GMT) shows a significantly higher mean upstream_response_time than request_time. This should not be possible.

As the data in graylog appears correct, I can only assume that the metrics plugin is applying a transform to the data as it builds the necessary histogram, and that this is leading to incorrect output being sent to graphite.

Please let me know if there is further debugging that I can do on this front to help aid in diagnosing this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions