Skip to content

Crawler throughput investigation findings #667

@ryanbrandenburg

Description

@ryanbrandenburg

I investigated the crawler with an eye towards improving the throughput, the following were my findings:

Looking at the logs for Crawler time spent it seems to me that the vast majority of the time being spent by the crawlers is being spent on jobs that error:

traces
| extend meta = parse_json(customDimensions)
| extend splits = split(message, '@')
| extend typ = tostring(splits[0])
| extend purl = split(tostring(split(splits[1], ' ')[0]), '/')
| extend type = tostring(purl[1])
| extend provider = tostring(purl[2])
| extend namespace = tostring(purl[3])
| extend package = tostring(purl[4])
| extend version = tostring(purl[5])
| extend timeInMinutes = toint(meta["time"])/1000/60
| summarize TimeSpent=sum(timeInMinutes)/2 by tostring(meta.outcome)

If we examine that by package we find that just a couple of packages account for ~5/6 of that time

traces
| extend meta = parse_json(customDimensions)
| extend splits = split(message, '@')
| extend typ = tostring(splits[0])
| extend purl = split(tostring(split(splits[1], ' ')[0]), '/')
| extend type = tostring(purl[1])
| extend provider = tostring(purl[2])
| extend namespace = tostring(purl[3])
| extend package = tostring(purl[4])
| extend version = tostring(purl[5])
| extend timeInMinutes = toint(meta["time"])/1000/60
| summarize TimeSpent=sum(timeInMinutes)/2 by namespace, package
| order by TimeSpent

Those packages at the moment are all large git repositories: aws/awd-sdk-java, sap/sapmachine, and chromium/chromium. Each of these packages present with a different error.

  • sap/sapmachine
    • 7 compute days spent/day
    • Unknown error out of scancode, seemingly truncated by obsolete messages out of typecode.
  • chromium/chromium
    • 10 compute days spent/day
    • Unknown error out of scancode, not returned because of attempt to read the large JSON object which was just produced (>512MB string will fail in javascript)
  • aws/aws-sdk-java
    • 261 compute days spent/day
    • Same large json problem that chromium has (the json output is so large because these are huge repos and each file has a hash). Possibly a separate underlying issue.

The time spent on these errors seem to vary depending on the day, but it seems like if we can resolve the errors here we could prevent retries on these very expensive repos via a successful result.

Another suggestion I have is that ClearlyDefined might start putting limits on particularly expensive scancode runs on a per/package basis. Some of these projects take 13-63 hours for an individual run and have as many as 130 runs executing in a single day. Limiting certain large repositories to a handful of runs per day may enable smaller packages to receive more timely consideration.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions