Advertisement
Advertisement


Parsing JSON with Unix tools


Question

I'm trying to parse JSON returned from a curl request, like so:

curl 'http://twitter.com/users/username.json' |
    sed -e 's/[{}]/''/g' | 
    awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}'

The above splits the JSON into fields, for example:

% ...
"geo_enabled":false
"friends_count":245
"profile_text_color":"000000"
"status":"in_reply_to_screen_name":null
"source":"web"
"truncated":false
"text":"My status"
"favorited":false
% ...

How do I print a specific field (denoted by the -v k=text)?

2018/08/30
1
914
8/30/2018 5:49:37 AM

Accepted Answer

There are a number of tools specifically designed for the purpose of manipulating JSON from the command line, and will be a lot easier and more reliable than doing it with Awk, such as jq:

curl -s 'https://api.github.com/users/lambda' | jq -r '.name'

You can also do this with tools that are likely already installed on your system, like Python using the json module, and so avoid any extra dependencies, while still having the benefit of a proper JSON parser. The following assume you want to use UTF-8, which the original JSON should be encoded in and is what most modern terminals use as well:

Python 3:

curl -s 'https://api.github.com/users/lambda' | \
    python3 -c "import sys, json; print(json.load(sys.stdin)['name'])"

Python 2:

export PYTHONIOENCODING=utf8
curl -s 'https://api.github.com/users/lambda' | \
    python2 -c "import sys, json; print json.load(sys.stdin)['name']"

Historical notes

This answer originally recommended jsawk, which should still work, but is a little more cumbersome to use than jq, and depends on a standalone JavaScript interpreter being installed which is less common than a Python interpreter, so the above answers are probably preferable:

curl -s 'https://api.github.com/users/lambda' | jsawk -a 'return this.name'

This answer also originally used the Twitter API from the question, but that API no longer works, making it hard to copy the examples to test out, and the new Twitter API requires API keys, so I've switched to using the GitHub API which can be used easily without API keys. The first answer for the original question would be:

curl 'http://twitter.com/users/username.json' | jq -r '.text'
2019/12/12
1180
12/12/2019 4:57:24 AM

To quickly extract the values for a particular key, I personally like to use "grep -o", which only returns the regex's match. For example, to get the "text" field from tweets, something like:

grep -Po '"text":.*?[^\\]",' tweets.json

This regex is more robust than you might think; for example, it deals fine with strings having embedded commas and escaped quotes inside them. I think with a little more work you could make one that is actually guaranteed to extract the value, if it's atomic. (If it has nesting, then a regex can't do it of course.)

And to further clean (albeit keeping the string's original escaping) you can use something like: | perl -pe 's/"text"://; s/^"//; s/",$//'. (I did this for this analysis.)

To all the haters who insist you should use a real JSON parser -- yes, that is essential for correctness, but

  1. To do a really quick analysis, like counting values to check on data cleaning bugs or get a general feel for the data, banging out something on the command line is faster. Opening an editor to write a script is distracting.
  2. grep -o is orders of magnitude faster than the Python standard json library, at least when doing this for tweets (which are ~2 KB each). I'm not sure if this is just because json is slow (I should compare to yajl sometime); but in principle, a regex should be faster since it's finite state and much more optimizable, instead of a parser that has to support recursion, and in this case, spends lots of CPU building trees for structures you don't care about. (If someone wrote a finite state transducer that did proper (depth-limited) JSON parsing, that would be fantastic! In the meantime we have "grep -o".)

To write maintainable code, I always use a real parsing library. I haven't tried jsawk, but if it works well, that would address point #1.

One last, wackier, solution: I wrote a script that uses Python json and extracts the keys you want, into tab-separated columns; then I pipe through a wrapper around awk that allows named access to columns. In here: the json2tsv and tsvawk scripts. So for this example it would be:

json2tsv id text < tweets.json | tsvawk '{print "tweet " $id " is: " $text}'

This approach doesn't address #2, is more inefficient than a single Python script, and it's a little brittle: it forces normalization of newlines and tabs in string values, to play nice with awk's field/record-delimited view of the world. But it does let you stay on the command line, with more correctness than grep -o.

2014/03/27

On the basis that some of the recommendations here (esp in the comments) suggested the use of Python, I was disappointed not to find an example.

So, here's a one liner to get a single value from some JSON data. It assumes that you are piping the data in (from somewhere) and so should be useful in a scripting context.

echo '{"hostname":"test","domainname":"example.com"}' | python -c 'import json,sys;obj=json.load(sys.stdin);print obj["hostname"]'
2019/02/26

Following MartinR and Boecko's lead:

$ curl -s 'http://twitter.com/users/username.json' | python -mjson.tool

That will give you an extremely grep friendly output. Very convenient:

$ curl -s 'http://twitter.com/users/username.json' | python -mjson.tool | grep my_key
2016/06/15

You could just download jq binary for your platform and run (chmod +x jq):

$ curl 'https://twitter.com/users/username.json' | ./jq -r '.name'

It extracts "name" attribute from the json object.

jq homepage says it is like sed for JSON data.

2013/05/30

Using Node.js

If the system has installed, it's possible to use the -p print and -e evaulate script flags with JSON.parse to pull out any value that is needed.

A simple example using the JSON string { "foo": "bar" } and pulling out the value of "foo":

$ node -pe 'JSON.parse(process.argv[1]).foo' '{ "foo": "bar" }'
bar

Because we have access to cat and other utilities, we can use this for files:

$ node -pe 'JSON.parse(process.argv[1]).foo' "$(cat foobar.json)"
bar

Or any other format such as an URL that contains JSON:

$ node -pe 'JSON.parse(process.argv[1]).name' "$(curl -s https://api.github.com/users/trevorsenior)"
Trevor Senior
2013/11/26

Source: https://stackoverflow.com/questions/1955505
Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Email: [email protected]