Skip to content

Verify statistics math in nodestats.py #3

@xaur

Description

@xaur

Node stats that we post in DJ are computed by the nodestats.py program.

Below are the important steps, samples of real data, and some points of concern I would like to verify, but feel free to check everything else too.

The high level algo is:

  • run an SQL query against dcr.farm. The result has a list of pairs (timestamp, node count) for each user agent. In other words, a month of dcr.farm's node crawling data is grouped by day and by user agent.
  • for each user agent, calculate a mean of daily node counts
  • arrange user agents and their stats in groups
  • calculate node % share for each group

To demonstrate how grouping works I will use data for Nov 2020 and the group of user agents called "dcrd v1.6 dev builds" which includes these 5 UAs:

  • /dcrwire:0.4.0/dcrd:1.6.0(pre)/
  • /dcrwire:0.4.0/dcrd:1.6.0(rc1)/
  • /dcrwire:0.4.0/dcrd:1.6.0(rc2)/
  • /dcrwire:0.4.0/dcrd:1.6.0(rc3)/
  • /dcrwire:0.4.0/dcrd:1.6.0/

The code is about 60 lines long and starts at line 67.

1. we start by getting the data from dcr.farm with the following SQL query on line 53:

https://charts.dcr.farm/api/datasources/proxy/1/query?db=decred&q=
SELECT count(distinct("addr")) FROM "peers"
 WHERE time >= {start_ms}ms and time < {end_ms}ms
 GROUP BY time(1d), "useragent_tag" fill(none)

Here I wonder if the query is correct, esp. grouping into 1 day windows?

I could not find the docs describing these functions, but something close is here.

That query gives us data like (example for 2 UAs out of 5):

{
  "results": [
    {
      "series": [
        {
          "columns": [ "time", "count" ],
          "name": "peers",
          "tags": {
            "useragent_tag": "/dcrwire:0.4.0/dcrd:1.6.0(pre)/"
          },
          "values": [
            [
              "2020-11-01T00:00:00Z",
              2
            ], [
              "2020-11-02T00:00:00Z",
              4
            ], [
              "2020-11-03T00:00:00Z",
              2
            ], [
              "2020-11-04T00:00:00Z",
              6
            ], // ...
          ]
        }, {
          "columns": [ "time", "count" ],
          "name": "peers",
          "tags": {
            "useragent_tag": "/dcrwire:0.4.0/dcrd:1.6.0(rc1)/"
          },
          "values": [
            [
              "2020-11-01T00:00:00Z",
              5
            ], [
              "2020-11-02T00:00:00Z",
              6
            ], [
              "2020-11-03T00:00:00Z",
              5
            ], [
              "2020-11-04T00:00:00Z",
              2
            ], // ...
          ]
        }, // ...
      ],
      "statement_id": 0
    }
  ]
}

2. for each user agent we extract daily counts of distinct addrs, and use those to calculate a mean for each UA. We also accumulate the sum of all means:

get_count = operator.itemgetter(1)

for series in dcrfarm_data["results"][0]["series"]:
    ua = series["tags"]["useragent_tag"]
    # get daily counts of distinct addrs for each UA
    daily_counts = list(map(get_count, series["values"]))
    daily_mean = statistics.mean(daily_counts)
    daily_mean_sum += daily_mean

below are the values for UAs in our group:

  • /dcrwire:0.4.0/dcrd:1.6.0(pre)/

    • daily_counts: [2, 4, 2, 6, 9, 5, 2, 4, 7, 5, 1, 2, 1, 5, 5, 5, 6, 7, 1, 10, 14, 3, 2, 3, 5, 4, 7, 6, 10]
    • daily_mean: 4.93
  • /dcrwire:0.4.0/dcrd:1.6.0(rc1)/

    • daily_counts: [5, 6, 5, 2, 2, 3, 2, 1, 5, 4, 2, 2, 3, 6, 4, 6, 2, 2, 3, 3, 2, 5, 4, 4, 2, 1, 1, 1, 5]
    • daily_mean: 3.21
  • /dcrwire:0.4.0/dcrd:1.6.0(rc2)/

    • daily_counts: [6, 5, 3, 8, 11, 8, 5, 2, 6, 2, 7, 14, 9, 11, 8, 7, 10, 7, 12, 6, 4, 9, 3, 5, 4, 4]
    • daily_mean: 6.77
  • /dcrwire:0.4.0/dcrd:1.6.0(rc3)/:

    • daily_counts: [1, 3, 7, 6, 3, 4, 10, 6, 2, 4, 7, 8, 8, 9]
    • daily_mean: 5.57
  • /dcrwire:0.4.0/dcrd:1.6.0/

    • daily_counts: [3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 6, 7, 6, 9, 6, 7, 7, 8, 2, 5, 6, 6, 10, 8, 2, 4]
    • daily_mean: 4.42

And the daily_mean_sum is 196.94.

(looks like RC3 is missing half of month's datapoints).

Here my concern is

  • is it too destructive to do a mean aggregation over already aggregated daily numbers?
  • do we call a mean of daily values a "daily mean" or "monthly mean"?

3. we use computed daily_mean's and daily_mean_sum to derive daily ratios:

for us in ua_stats:
    ua, daily_mean = us # unpack
    daily_ratio = daily_mean / daily_mean_sum
    us.append(daily_ratio)

this gives us the following values:

/dcrwire:0.4.0/dcrd:1.6.0(pre)/ : 0.0250
/dcrwire:0.4.0/dcrd:1.6.0(rc1)/ : 0.0163
/dcrwire:0.4.0/dcrd:1.6.0(rc2)/ : 0.0344
/dcrwire:0.4.0/dcrd:1.6.0(rc3)/ : 0.0283
/dcrwire:0.4.0/dcrd:1.6.0/      : 0.0225

These are supposed to be % ratios of individual UAs. Not used yet but may be used in the future so worth verifying.

Here I wonder if the procedure is correct to obtain "average monthly percentage" for each user agent.

4. we calculate similar ratios for groups (with the help of aux. lookup structure):

get_mean = operator.itemgetter(1)

for gname, gstats in grouped_stats.items():
    # get daily means of the group
    gdaily_means = list(map(get_mean, gstats))
    gdaily_mean_sum = sum(gdaily_means)
    gdaily_ratio = gdaily_mean_sum / daily_mean_sum
    tracked_ratio += gdaily_ratio
    group_stats.append(GroupStats(gname, gdaily_mean_sum, gdaily_ratio, gstats))

the loop does one iteration for each group. tracked_ratio is accumulated to later calculate the complement untracked_ratio and use that for "Other X%".

One iteration for our v1.6 group gives:

gdaily_means: [4.93, 3.21, 6.77, 5.57, 4.42]
gdaily_mean_sum: 24.9
gdaily_ratio: 24.9 / 196.94 = 0.1264

That last value is what we round and print as the final result: "13% dcrd v1.6 dev builds".

Same concern here, is it fair to obtain group's share this way?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions