This provides several JSON-related tools implemented with Jackson. Its goal is to be usable with extremely large json streams, and everything needs to happen streaming.
I tried several tools implemented in python (python -m json.tool, 'jsongrep'), but those consumed very much memory when I fed them a json stream of a Gigabyte or so, and seemed not usable for that, so I implemented similar tools based on jackson2 in java. They are streaming and don’t need much memory, and can deal with huge streams of json.
All tools support a -help argument for an overview of all supported options.
The executable jars are packaged in a zip, which can be downloaded here.
This zip also contain executable scripts to call them with java -jar, which will work in a unix or osx environment, and can be unzipped somewhere in your path. Typing this install in the current directory:
curl -o json.zip https://repo1.maven.org/maven2/org/meeuw/mihxil-json/1.0/mihxil-json-1.0-all.zip ; unzip -o json.zip ; rm json.zipAs a library, please refer to maven central.
Implemented in mihxil-json-formatter
This is actually just a thin layer around Jackson’s JsonParser and JsonGenerator.
Usage
jsonformat [<infile>] [<outfile>]
infile: defaults to stdin (can explicitely set to stdin as '-'). Can
be file name but can also be a remote URL
outfile: default to stdout
For a file of nearly one Gb:
michiel@belono:/tmp$ time jsonformat alldocs.json alldocs.formatted.json
real 0m27.783s
user 0m19.880s
sys 0m5.686s
michiel@belono:/tmp$ ls -lah alldocs.*
-rw-rw-r-- 1 michiel wheel 1.3G Feb 22 18:17 alldocs.formatted.json
-rw-rw-r-- 1 michiel wheel 928M Feb 22 14:19 alldocs.jsonImplemented in mihxil-json-grep
This is a streaming 'jsongrep', and works a bit like grep. It e.g. can be used to produce one line abstracts of the records which can easily be processed further by a normal grep or awk or so.
The 'grep' (and 'sed') implementation is basically configured using PathMatcher and extensions. Seperate of that there is a Parser that can convert a string to (a set of) PathMatcher(s). The provided command line tools depend on that. But all functionality is also available by constructing java objects in code.
Example
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[1].e
y.arr[1].e=zThis just demonstrate a simple path match. It returns the matched path together with the associated value.
It can also accept a second optional parameter which is a file or an URL:
$ jsongrep y.arr[*].*~[xz] test.json
y.arr[1].e=zGenerally the available options are documented in the tools itself too
$ jsongrep --help
jsongrep - 1.0 - See https://github.com/mihxil/json
usage: jsongrep [OPTIONS] [|-]
-?,--help print this message
-d,--debug Debug
-i,--ignoreArrays Ignore arrays (no need to match those)
-m,--max Max number of records
-o,--output Output format, one of [PATHANDVALUE, PATHANDFULLVALUE, KEYANDVALUE,
KEYANDFULLVALUE, PATH, KEY, VALUE, FULLVALUE]
-r,--record Record pattern (default to no matching at all). On match, a record
separator will be outputted.
-rs,--recordsep Record separator
-s,--sep Separator (defaults to newline)
-sf,--sortfields Sort the fields of a found 'record', according to the order of the
matchers.
-v,--version Print versionIt is possible to specify more than one match
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[1].e,a
a=b
y.arr[1].e=zYou can use wildcards in the path:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[*].e
y.arr[1].e=z
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.*[*].d
y.arr[0].d=yThis is useful for array indices. But you can also choose it completely ignore array indices in matching, which may simplify things:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep -ignoreArrays y.arr.e
y.arr[1].e=zRegex matching on key is also possible, which can e.g. be used to output different keys at the same level more easily.
echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z', 'f': 'g'}]}}" | jsongrep -output PATHANDFULLVALUE -ignoreArrays '*.arr./d|e/'
y.arr[0].d=y
y.arr[1].e=zwhich is equivalent to:
echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z', 'f': 'g'}]}}" | jsongrep -output PATHANDFULLVALUE -ignoreArrays '*.arr.d,*.arr.e'
y.arr[0].d=y
y.arr[1].e=zIf a matcher does not match a simple value but an object or an array, it will be reported like this:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr,y
y.arr=[...]
y={...}Unless you specify a different output format:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep -output PATHANDFULLVALUE y.arr,y
y.arr=[{"d":"y"},{"e":"z"}]
y={"c":"x","arr":[{"d":"y"},{"e":"z"}]}It is possible to output less
$ jsongrep -output VALUE y.arr[*].*~[xz] test.json
z
$ jsongrep -output KEY y.arr[*].*~[xz] test.json
e
$ jsongrep -output PATH y.arr[*].*~[xz] test.json
y.arr[1].e
$ jsongrep -output KEYANDVALUE y.arr[*].*~[xz] test.json
e=zAnother example on a couchdb database (find documents where a certain field has certain value)
$ jsongrep rows.*.doc.workflow=FOR_REPUBLICATION,rows.*.doc.mid http://couchdbhost/database/_all_docs?include_docs=true |
grep -A 1 workflowIt is also possible to match on value rather than path alone:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[*].*=z
y.arr[1].e=zThat can also be done using regular expressions
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep y.arr[*].*~[xz]
y.arr[1].e=zYou can match directly inside the tree ('…' means 'an arbitrary path)
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep '...e'
y.arr[1].e=zMatching can be implemented with a javascript function as well:
$ echo "{a:'b', y: {c:'x', arr:[{d:'y'}, {e:'z'}]}}" | jsongrep -output KEYANDFULLVALUE '...arr[*] function(doc) { return doc.d == "y"; }'
[0]={"d":"y"}jsongrep supports the '-sep', '-recordsep' and '-record' parameters. They are intended for example to generate one line abstracts of a bunch of json records. E.g. create a file with 3 fields per line, separated by a tab. The 3 fields are 3 different keys from an array of json objects.
$ jsongrep -output VALUE -sep " " -record '*' '*.mid,*.publishDate,*.lastModified' es.all.json | sort > es.txtThe -record parameter defines what constitutes the start of a new record. If this matches a 'recordsep' will be outputted (this defaults to a newline). Normally between matches a newline is outputted, but when you use -record you’d probably don’t want that. In this example using the -sep argument a tab character is outputted between matches.
Normally, when using this 'record' functionality, the output record will be implicitely sorted like the matches. So in this case first the 'mid', then 'publishDate' then 'lastModified', independent from the order they appeared in the json document. With the '-sortfields' parameter you can disable this behaviour, and simply output in the original order.
A variant of 'jsongrep' is 'jsonsed'. This will just output the incoming json, but it will apply the replacements (which are possible in jsongrep too).
$ echo '{ "items" : [ { "a" : "abc def"}, { "a" : "xyz qwv"}]} ' | jsonsed -ignoreArrays -format 'items.a~abc\s*(.*)~def'
{
"items" : [ {
"a" : "def"
}, {
"a" : "xyz qwv"
} ]
}|
Note
|
The syntax for replacement currenlty is <path>~<value>~<replacement>. This will make it hard to have
a literal ~ in the value. The parser may be changed to be more like sed itself. <path>~<ANY><value><ANY><replacement> or so (where <ANY> will be a character you can choose like / or | )
|
Implemented in mihxil-es, and contains a tool to download an entire elasticsearch database.
This is unfinished. The idea is to have to tool to have something similar to x-include, but for json.