Verify statistics math in nodestats.py

Node stats that we post in DJ are computed by the [nodestats.py](https://github.com/decredcommunity/network-stats/blob/master/nodes/nodestats.py) program.

Below are the important steps, samples of real data, and some points of concern I would like to verify, but feel free to check everything else too.

The high level algo is:

- run an SQL query against dcr.farm. The result has a list of pairs (timestamp, node count) for each user agent. In other words, a month of dcr.farm's node crawling data is grouped by day and by user agent.
- for each user agent, calculate a mean of daily node counts
- arrange user agents and their stats in groups
- calculate node % share for each group

To demonstrate how grouping works I will use data for Nov 2020 and the group of user agents called "dcrd v1.6 dev builds" which includes these 5 UAs:

- /dcrwire:0.4.0/dcrd:1.6.0(pre)/
- /dcrwire:0.4.0/dcrd:1.6.0(rc1)/
- /dcrwire:0.4.0/dcrd:1.6.0(rc2)/
- /dcrwire:0.4.0/dcrd:1.6.0(rc3)/
- /dcrwire:0.4.0/dcrd:1.6.0/

The code is about 60 lines long and starts at [line 67](https://github.com/decredcommunity/network-stats/blob/master/nodes/nodestats.py#L67).

1\. we start by getting the data from dcr.farm with the following SQL query on [line 53](https://github.com/decredcommunity/network-stats/blob/master/nodes/nodestats.py#L53):

```
https://charts.dcr.farm/api/datasources/proxy/1/query?db=decred&q=
SELECT count(distinct("addr")) FROM "peers"
 WHERE time >= {start_ms}ms and time < {end_ms}ms
 GROUP BY time(1d), "useragent_tag" fill(none)
```

Here I wonder if the query is correct, esp. grouping into 1 day windows?

I could not find the docs describing these functions, but something close is [here](https://grafana.com/docs/grafana/latest/variables/variable-types/add-interval-variable/).

That query gives us data like (example for 2 UAs out of 5):

```javascript
{
  "results": [
    {
      "series": [
        {
          "columns": [ "time", "count" ],
          "name": "peers",
          "tags": {
            "useragent_tag": "/dcrwire:0.4.0/dcrd:1.6.0(pre)/"
          },
          "values": [
            [
              "2020-11-01T00:00:00Z",
              2
            ], [
              "2020-11-02T00:00:00Z",
              4
            ], [
              "2020-11-03T00:00:00Z",
              2
            ], [
              "2020-11-04T00:00:00Z",
              6
            ], // ...
          ]
        }, {
          "columns": [ "time", "count" ],
          "name": "peers",
          "tags": {
            "useragent_tag": "/dcrwire:0.4.0/dcrd:1.6.0(rc1)/"
          },
          "values": [
            [
              "2020-11-01T00:00:00Z",
              5
            ], [
              "2020-11-02T00:00:00Z",
              6
            ], [
              "2020-11-03T00:00:00Z",
              5
            ], [
              "2020-11-04T00:00:00Z",
              2
            ], // ...
          ]
        }, // ...
      ],
      "statement_id": 0
    }
  ]
}
```

2\. for each user agent we extract daily counts of distinct addrs, and use those to calculate a mean for each UA. We also accumulate the sum of all means:

```python
get_count = operator.itemgetter(1)

for series in dcrfarm_data["results"][0]["series"]:
    ua = series["tags"]["useragent_tag"]
    # get daily counts of distinct addrs for each UA
    daily_counts = list(map(get_count, series["values"]))
    daily_mean = statistics.mean(daily_counts)
    daily_mean_sum += daily_mean
```

below are the values for UAs in our group:

- /dcrwire:0.4.0/dcrd:1.6.0(pre)/

  - daily_counts: [2, 4, 2, 6, 9, 5, 2, 4, 7, 5, 1, 2, 1, 5, 5, 5, 6, 7, 1, 10, 14, 3, 2, 3, 5, 4, 7, 6, 10]
  - daily_mean: 4.93

- /dcrwire:0.4.0/dcrd:1.6.0(rc1)/

  - daily_counts: [5, 6, 5, 2, 2, 3, 2, 1, 5, 4, 2, 2, 3, 6, 4, 6, 2, 2, 3, 3, 2, 5, 4, 4, 2, 1, 1, 1, 5]
  - daily_mean: 3.21

- /dcrwire:0.4.0/dcrd:1.6.0(rc2)/

  - daily_counts: [6, 5, 3, 8, 11, 8, 5, 2, 6, 2, 7, 14, 9, 11, 8, 7, 10, 7, 12, 6, 4, 9, 3, 5, 4, 4]
  - daily_mean: 6.77

- /dcrwire:0.4.0/dcrd:1.6.0(rc3)/:

  - daily_counts: [1, 3, 7, 6, 3, 4, 10, 6, 2, 4, 7, 8, 8, 9]
  - daily_mean: 5.57

- /dcrwire:0.4.0/dcrd:1.6.0/

  - daily_counts: [3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 6, 7, 6, 9, 6, 7, 7, 8, 2, 5, 6, 6, 10, 8, 2, 4]
  - daily_mean: 4.42

And the `daily_mean_sum` is 196.94.

(looks like RC3 is missing half of month's datapoints).

Here my concern is

- is it too destructive to do a mean aggregation over already aggregated daily numbers?
- do we call a mean of daily values a "daily mean" or "monthly mean"?

3\. we use computed `daily_mean`'s and `daily_mean_sum` to derive daily ratios:

```python
for us in ua_stats:
    ua, daily_mean = us # unpack
    daily_ratio = daily_mean / daily_mean_sum
    us.append(daily_ratio)
```

this gives us the following values:

```
/dcrwire:0.4.0/dcrd:1.6.0(pre)/ : 0.0250
/dcrwire:0.4.0/dcrd:1.6.0(rc1)/ : 0.0163
/dcrwire:0.4.0/dcrd:1.6.0(rc2)/ : 0.0344
/dcrwire:0.4.0/dcrd:1.6.0(rc3)/ : 0.0283
/dcrwire:0.4.0/dcrd:1.6.0/      : 0.0225
```

These are supposed to be % ratios of individual UAs. Not used yet but may be used in the future so worth verifying.

Here I wonder if the procedure is correct to obtain "average monthly percentage" for each user agent.

4\. we calculate similar ratios for _groups_ (with the help of aux. lookup structure):

```python
get_mean = operator.itemgetter(1)

for gname, gstats in grouped_stats.items():
    # get daily means of the group
    gdaily_means = list(map(get_mean, gstats))
    gdaily_mean_sum = sum(gdaily_means)
    gdaily_ratio = gdaily_mean_sum / daily_mean_sum
    tracked_ratio += gdaily_ratio
    group_stats.append(GroupStats(gname, gdaily_mean_sum, gdaily_ratio, gstats))
```

the loop does one iteration for each group. `tracked_ratio` is accumulated to later calculate the complement `untracked_ratio` and use that for "Other X%".

One iteration for our v1.6 group gives:

```
gdaily_means: [4.93, 3.21, 6.77, 5.57, 4.42]
gdaily_mean_sum: 24.9
gdaily_ratio: 24.9 / 196.94 = 0.1264
```

That last value is what we round and print as the final result: "13% dcrd v1.6 dev builds".

Same concern here, is it fair to obtain group's share this way?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Verify statistics math in nodestats.py #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Verify statistics math in nodestats.py #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions