-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Node stats that we post in DJ are computed by the nodestats.py program.
Below are the important steps, samples of real data, and some points of concern I would like to verify, but feel free to check everything else too.
The high level algo is:
- run an SQL query against dcr.farm. The result has a list of pairs (timestamp, node count) for each user agent. In other words, a month of dcr.farm's node crawling data is grouped by day and by user agent.
- for each user agent, calculate a mean of daily node counts
- arrange user agents and their stats in groups
- calculate node % share for each group
To demonstrate how grouping works I will use data for Nov 2020 and the group of user agents called "dcrd v1.6 dev builds" which includes these 5 UAs:
- /dcrwire:0.4.0/dcrd:1.6.0(pre)/
- /dcrwire:0.4.0/dcrd:1.6.0(rc1)/
- /dcrwire:0.4.0/dcrd:1.6.0(rc2)/
- /dcrwire:0.4.0/dcrd:1.6.0(rc3)/
- /dcrwire:0.4.0/dcrd:1.6.0/
The code is about 60 lines long and starts at line 67.
1. we start by getting the data from dcr.farm with the following SQL query on line 53:
https://charts.dcr.farm/api/datasources/proxy/1/query?db=decred&q=
SELECT count(distinct("addr")) FROM "peers"
WHERE time >= {start_ms}ms and time < {end_ms}ms
GROUP BY time(1d), "useragent_tag" fill(none)
Here I wonder if the query is correct, esp. grouping into 1 day windows?
I could not find the docs describing these functions, but something close is here.
That query gives us data like (example for 2 UAs out of 5):
{
"results": [
{
"series": [
{
"columns": [ "time", "count" ],
"name": "peers",
"tags": {
"useragent_tag": "/dcrwire:0.4.0/dcrd:1.6.0(pre)/"
},
"values": [
[
"2020-11-01T00:00:00Z",
2
], [
"2020-11-02T00:00:00Z",
4
], [
"2020-11-03T00:00:00Z",
2
], [
"2020-11-04T00:00:00Z",
6
], // ...
]
}, {
"columns": [ "time", "count" ],
"name": "peers",
"tags": {
"useragent_tag": "/dcrwire:0.4.0/dcrd:1.6.0(rc1)/"
},
"values": [
[
"2020-11-01T00:00:00Z",
5
], [
"2020-11-02T00:00:00Z",
6
], [
"2020-11-03T00:00:00Z",
5
], [
"2020-11-04T00:00:00Z",
2
], // ...
]
}, // ...
],
"statement_id": 0
}
]
}2. for each user agent we extract daily counts of distinct addrs, and use those to calculate a mean for each UA. We also accumulate the sum of all means:
get_count = operator.itemgetter(1)
for series in dcrfarm_data["results"][0]["series"]:
ua = series["tags"]["useragent_tag"]
# get daily counts of distinct addrs for each UA
daily_counts = list(map(get_count, series["values"]))
daily_mean = statistics.mean(daily_counts)
daily_mean_sum += daily_meanbelow are the values for UAs in our group:
-
/dcrwire:0.4.0/dcrd:1.6.0(pre)/
- daily_counts: [2, 4, 2, 6, 9, 5, 2, 4, 7, 5, 1, 2, 1, 5, 5, 5, 6, 7, 1, 10, 14, 3, 2, 3, 5, 4, 7, 6, 10]
- daily_mean: 4.93
-
/dcrwire:0.4.0/dcrd:1.6.0(rc1)/
- daily_counts: [5, 6, 5, 2, 2, 3, 2, 1, 5, 4, 2, 2, 3, 6, 4, 6, 2, 2, 3, 3, 2, 5, 4, 4, 2, 1, 1, 1, 5]
- daily_mean: 3.21
-
/dcrwire:0.4.0/dcrd:1.6.0(rc2)/
- daily_counts: [6, 5, 3, 8, 11, 8, 5, 2, 6, 2, 7, 14, 9, 11, 8, 7, 10, 7, 12, 6, 4, 9, 3, 5, 4, 4]
- daily_mean: 6.77
-
/dcrwire:0.4.0/dcrd:1.6.0(rc3)/:
- daily_counts: [1, 3, 7, 6, 3, 4, 10, 6, 2, 4, 7, 8, 8, 9]
- daily_mean: 5.57
-
/dcrwire:0.4.0/dcrd:1.6.0/
- daily_counts: [3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 6, 7, 6, 9, 6, 7, 7, 8, 2, 5, 6, 6, 10, 8, 2, 4]
- daily_mean: 4.42
And the daily_mean_sum is 196.94.
(looks like RC3 is missing half of month's datapoints).
Here my concern is
- is it too destructive to do a mean aggregation over already aggregated daily numbers?
- do we call a mean of daily values a "daily mean" or "monthly mean"?
3. we use computed daily_mean's and daily_mean_sum to derive daily ratios:
for us in ua_stats:
ua, daily_mean = us # unpack
daily_ratio = daily_mean / daily_mean_sum
us.append(daily_ratio)this gives us the following values:
/dcrwire:0.4.0/dcrd:1.6.0(pre)/ : 0.0250
/dcrwire:0.4.0/dcrd:1.6.0(rc1)/ : 0.0163
/dcrwire:0.4.0/dcrd:1.6.0(rc2)/ : 0.0344
/dcrwire:0.4.0/dcrd:1.6.0(rc3)/ : 0.0283
/dcrwire:0.4.0/dcrd:1.6.0/ : 0.0225
These are supposed to be % ratios of individual UAs. Not used yet but may be used in the future so worth verifying.
Here I wonder if the procedure is correct to obtain "average monthly percentage" for each user agent.
4. we calculate similar ratios for groups (with the help of aux. lookup structure):
get_mean = operator.itemgetter(1)
for gname, gstats in grouped_stats.items():
# get daily means of the group
gdaily_means = list(map(get_mean, gstats))
gdaily_mean_sum = sum(gdaily_means)
gdaily_ratio = gdaily_mean_sum / daily_mean_sum
tracked_ratio += gdaily_ratio
group_stats.append(GroupStats(gname, gdaily_mean_sum, gdaily_ratio, gstats))the loop does one iteration for each group. tracked_ratio is accumulated to later calculate the complement untracked_ratio and use that for "Other X%".
One iteration for our v1.6 group gives:
gdaily_means: [4.93, 3.21, 6.77, 5.57, 4.42]
gdaily_mean_sum: 24.9
gdaily_ratio: 24.9 / 196.94 = 0.1264
That last value is what we round and print as the final result: "13% dcrd v1.6 dev builds".
Same concern here, is it fair to obtain group's share this way?