Skip to content
This repository was archived by the owner on Oct 9, 2018. It is now read-only.
This repository was archived by the owner on Oct 9, 2018. It is now read-only.

Use ujson for faster job execution #70

@indygreg

Description

@indygreg

Various MR jobs are using the built-in json or simplejson packages for deserializing json payloads. Switching to ujson gives a significant speed-up.

I have a single day of the Firefox update hotfix payloads cached locally. There are 627,404 records that lz4 decompress to 11,655,091,683 bytes. I have a dead simple MR script that performs a JSON deserialize and extracts a single value from the payload and combines it.

Here is the performance of that job with 8 concurrent processes on 4+4HT cores with various JSON implementations.

built-in json

real 0m54.250s
user 6m26.780s
sys 0m9.834s

real 0m53.691s
user 6m22.161s
sys 0m9.710s

real 0m52.698s
user 6m14.038s
sys 0m9.596s

simplejson

real 0m34.825s
user 4m7.692s
sys 0m7.125s

real 0m34.218s
user 4m3.766s
sys 0m7.055s

real 0m34.830s
user 4m4.105s
sys 0m7.043s

ujson

real 0m26.212s
user 3m6.775s
sys 0m5.789s

real 0m27.636s
user 3m16.358s
sys 0m6.077s

real 0m28.094s
user 3m18.188s
sys 0m6.227s

Averages

The averages for CPU time is:

json: 391s
simplejson: 252s
ujson: 200s

lzma --decompress --stdout on this data set takes about 83s of CPU time.

As the data demonstrates, ujson is significantly faster than simplejson and will thus make Telemetry jobs faster and more efficient.

My data should not need validation: any Google search on "Python json benchmark" will tell you others have reached the same conclusion that ujson is the bomb.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions