-
Notifications
You must be signed in to change notification settings - Fork 33
Description
I investigated the crawler with an eye towards improving the throughput, the following were my findings:
Looking at the logs for Crawler time spent it seems to me that the vast majority of the time being spent by the crawlers is being spent on jobs that error:
traces
| extend meta = parse_json(customDimensions)
| extend splits = split(message, '@')
| extend typ = tostring(splits[0])
| extend purl = split(tostring(split(splits[1], ' ')[0]), '/')
| extend type = tostring(purl[1])
| extend provider = tostring(purl[2])
| extend namespace = tostring(purl[3])
| extend package = tostring(purl[4])
| extend version = tostring(purl[5])
| extend timeInMinutes = toint(meta["time"])/1000/60
| summarize TimeSpent=sum(timeInMinutes)/2 by tostring(meta.outcome)
If we examine that by package we find that just a couple of packages account for ~5/6 of that time
traces
| extend meta = parse_json(customDimensions)
| extend splits = split(message, '@')
| extend typ = tostring(splits[0])
| extend purl = split(tostring(split(splits[1], ' ')[0]), '/')
| extend type = tostring(purl[1])
| extend provider = tostring(purl[2])
| extend namespace = tostring(purl[3])
| extend package = tostring(purl[4])
| extend version = tostring(purl[5])
| extend timeInMinutes = toint(meta["time"])/1000/60
| summarize TimeSpent=sum(timeInMinutes)/2 by namespace, package
| order by TimeSpent
Those packages at the moment are all large git repositories: aws/awd-sdk-java, sap/sapmachine, and chromium/chromium. Each of these packages present with a different error.
- sap/sapmachine
- 7 compute days spent/day
- Unknown error out of scancode, seemingly truncated by obsolete messages out of typecode.
- Bump pygments vendor to 2.19.2 aboutcode-org/typecode#47 should fix the obsolete messages, potentially revealing the underlying issue once consumed by scancode and then ClearlyDefined.
- chromium/chromium
- 10 compute days spent/day
- Unknown error out of scancode, not returned because of attempt to read the large JSON object which was just produced (>512MB string will fail in javascript)
- Handle large json failures #666 logs the error before attempting to parse and also catches parsing errors to help us not throw things away.
- aws/aws-sdk-java
- 261 compute days spent/day
- Same large json problem that chromium has (the json output is so large because these are huge repos and each file has a hash). Possibly a separate underlying issue.
The time spent on these errors seem to vary depending on the day, but it seems like if we can resolve the errors here we could prevent retries on these very expensive repos via a successful result.
Another suggestion I have is that ClearlyDefined might start putting limits on particularly expensive scancode runs on a per/package basis. Some of these projects take 13-63 hours for an individual run and have as many as 130 runs executing in a single day. Limiting certain large repositories to a handful of runs per day may enable smaller packages to receive more timely consideration.