Histogram output from parse of json-ish data in python

Question 1

The idea is exploratory data analysis, the output rendered as a histogram to get an idea of relative frequency of records:

enter image description here

The example data set looks like this:

{"blockNumber":"1941895","blockHash":"0x53464299a83cecc3e4d930b617c9518b8f74139265423d8110a919f5180bec79","hash":"0x0abe75e40a954d4d355e25e4498f3580e7d029769897d4187c323080a0be0fdd","from":"0x4586ffaf28e08b1613dd96ced9b57d52e8ad9d72","to":"0x91337a300e0361bddb2e377dd4e88ccb7796663d","gas":"21000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"1","value":"0x22c06103f88111000","timestamp":"2016-07-24 20:47:25 UTC"}
{"blockNumber":"1941645","blockHash":"0x78804d09bb4e7126f53133e33e3548e0f04a691c01661ab9b719c3811e54355e","hash":"0x22c2b6490900b21d67ca56066e127fa57c0af973b5d166ca1a4bf52fcb6cf81c","from":"0x81bbf9f19ffe8368efe7611ccf5dcbdb4618b645","to":"0xb01a7866a244dbb600a7bbd170d43d4221838868","gas":"90000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"0","value":"0x4563918244f40000","timestamp":"2016-07-24 19:57:50 UTC"}
{"blockNumber":"1941910","blockHash":"0xc7ba89fc0110a033c4bd03be4505014761141b956c228bc51ec49c15a4508ce4","hash":"0x8570106b0385caf729a17593326db1afe0d75e3f8c6daef25cd4a0499a873a6f","from":"0x91337a300e0361bddb2e377dd4e88ccb7796663d","to":"0x9fde2180b544b7690c35bdc66182eb843ac38030","gas":"90000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"6356","value":"0x41e92b66341ef0000","timestamp":"2016-07-24 20:50:12 UTC"}
{"blockNumber":"1941919","blockHash":"0x4785c1b1a678cf7058e1fed3fc1c7d33c4326c2fb309f5fc75688f23d496b61c","hash":"0x8adfe7fc3cf0eb34bb56c59fa3dc4fdd3ec3f3514c0100fef800f065219b7707","from":"0x69ca903e87329fd63a3c7b2d3efde6a9bf3c3d45","to":"0xbfc39b6f805a9e40e77291aff27aee3c96915bdd","gas":"40000","gasUsed":"29130","gasPrice":"30000000000","input":"","logs":[{"address":"0xbfc39b6f805a9e40e77291aff27aee3c96915bdd","topics":["0x23919512b2162ddc59b67a65e3b03c419d4105366f7d4a632f5d3c3bee9b1cff"],"data":"AAAAAAAAAAAAAAAAwNMyg48U70L83hzyUYxCfdtnZyk="}],"nonce":"20","value":"0x1d2eb2accbaf90800","timestamp":"2016-07-24 20:52:08 UTC"}
{"blockNumber":"1941922","blockHash":"0xd46dbf526f6d7c9197e841c8a4d7b2f4abdac4a62860cffabb943a46d07a86d4","hash":"0x8b0fe2b7727664a14406e7377732caed94315b026b37577e2d9d258253067553","from":"0x0b2c5cba2dc240e867f7721412c20e6016596d26","to":"0x9c83fe12c7575ea7350019e04253d3620957851f","gas":"21000","gasUsed":"21000","gasPrice":"21000000000","input":"","logs":[],"nonce":"2","value":"0x7ce66c50e2840000","timestamp":"2016-07-24 20:52:51 UTC"}
{"blockNumber":"1941688","blockHash":"0x86bb1e90d0fa7be11d3f196057976383bb73cbd1596992e868155a576b5ddfb9","hash":"0x244b29b60c696f4ab07c36342344fe6116890f8056b4abc9f734f7a197c93341","from":"0x006cdc135b4e3a89d3ac1027ec3de609b8fff500","to":"0x58ae42a38d6b33a1e31492b60465fa80da595755","gas":"50000","gasUsed":"50000","gasPrice":"20000000000","input":"","logs":[],"nonce":"47","value":"0xc7140013deaf40","error":"invalid jump destination (PUSH1) 2","timestamp":"2016-07-24 20:06:38 UTC"}
{"blockNumber":"1941794","blockHash":"0x41ee74e34cbf9ef4116febea958dbc260e2da3a6bf6f601bfaeb2cd9ab944a29","hash":"0xf2b5b8fb173e371cbb427625b0339f6023f8b4ec3701b7a5c691fa9cef9daf63","from":"0x3c0cbb196e3847d40cb4d77d7dd3b386222998d9","to":"0x2ba24c66cbff0bda0e3053ea07325479b3ed1393","gas":"121000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"14","value":"0x24406420d09ce7440000","timestamp":"2016-07-24 20:28:11 UTC"}
{"blockNumber":"1941716","blockHash":"0x75e1602cad967a781f4a2ea9e19c97405fe1acaa8b9ad333fb7288d98f7b49e3","hash":"0xf8f2a397b0f7bb1ff212b6bcc57e4a56ce3e27eb9f5839fef3e193c0252fab26","from":"0xa0480c6f402b036e33e46f993d9c7b93913e7461","to":"0xb2ea1f1f997365d1036dd6f00c51b361e9a3f351","gas":"121000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"1","value":"0xde0b6b3a7640000","timestamp":"2016-07-24 20:12:17 UTC"}
{"blockNumber":"1941794","blockHash":"0x41ee74e34cbf9ef4116febea958dbc260e2da3a6bf6f601bfaeb2cd9ab944a29","hash":"0xf275b8fb173e371cbb427625b0339f6023f8b4ec3701b7a5c691fa9cef9daf63","from":"0x3c0cbb196e3847d40cb4d77d7dd3b386222998d9","to":"0x2ba24c66cbff0bda0e3053ea07325479b3ed1393","gas":"121000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"14","value":"0x24406420d09ce7440000","timestamp":"2016-07-24 20:28:11 UTC"}
{"blockNumber":"1941794","blockHash":"0x41ee74e34cbf9ef4116febea958dbc260e2da3a6bf6f601bfaeb2cd9ab944a29","hash":"0xf285b8fb173e371cbb427625b0339f6023f8b4ec3701b7a5c691fa9cef9daf63","from":"0x3c0cbb196e3847d40cb4d77d7dd3b386222998d9","to":"0x2ba24c66cbff0bda0e3053ea07325479b3ed1393","gas":"121000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"14","value":"0x24406420d09ce7440000","timestamp":"2016-07-24 20:28:11 UTC"}
{"blockNumber":"1941895","blockHash":"0x53464299a83cecc3e4d930b617c9518b8f74139265423d8110a919f5180bec79","hash":"0x0abg75e40a954d4d355e25e4498f3580e7d029769897d4187c323080a0be0fdd","from":"0x4586ffaf28e08b1613dd96ced9b57d52e8ad9d72","to":"0x91337a300e0361bddb2e377dd4e88ccb7796663d","gas":"21000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"1","value":"0x22c06103f88111000","timestamp":"2016-07-24 20:47:25 UTC"}

The code is here:

#data processing
import re
import pprint
#sorting
import operator
from collections import Counter
#visualization rendering
import matplotlib.pyplot as plt
import numpy as np
from operator import itemgetter
# read in the data
data = open('toy.json', 'r')
new_dict = {}
for line in data: 
 identifier = re.search('(\"hash\"\:\s?\"(\w+)\")', line)
 if identifier:
 found_identifier = identifier.group(2)
 # print(found)
 gas = re.search('(\"gas\"\:\s?\"(\w+)\")', line)
 if gas:
 found_gas = gas.group(2)
 # print(found)
 new_dict.update({found_identifier:found_gas})
# make it into tuple
sorted_x = sorted(new_dict.items(), key=operator.itemgetter(1))
# prepare for counter
# flat_list = [x[1] for x in sorted_x if int(x[1]) > 1]
flat_list = [x[1] for x in sorted_x]
# count 'em up
flat_list = Counter(flat_list)
# prune threshold
to_remove = set()
for key, value in flat_list.viewitems():
 if value < 2:
 to_remove.add(key)
for key in to_remove:
 del flat_list[key]
pprint.pprint(flat_list)
# visualization
c = Counter(flat_list).items()
c.sort(key=itemgetter(1))
labels, values = zip(*c)
indexes = np.arange(len(labels))
width = 1
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()

Question 2

Can't you make the data source a real JSON fromat? You would only need to make it a list, so add [] around the whole thing and add , after each dictionary.

Question 3

Python has a built-in json module, so you should use that instead of parsing the lines by yourself, no need for regex here.

You can just iterate over the lines of the file (which seem to contain a valid JSON object each), and build a list from that:

import json
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
def at_least(c, threshold):
 """Return a Counter of values which are at_least (>=) threshold"""
 return Counter(el for el in c.elements() if c[el] >= threshold)
def draw_hist(labels, values):
 indices = np.arange(len(labels))
 width = 1
 plt.bar(indices, values, width)
 plt.xticks(indices + width * 0.5, labels)
 plt.show()
if __name__ == "__main__":
 with open("toy.json") as f:
 data = [json.loads(line) for line in f]
 gas = {line["hash"]: int(line["gas"]) for line in data}
 gas_hist_values = at_least(Counter(gas.values()), 2)
 labels, values = zip(*reversed(gas_hist_values.most_common()))
 draw_hist(labels, values)

Note that if a hash appears twice, this code, just like your original code, overwrites it with the value that appears last.

I took out a lot of unnecessary sorting and calls of Counter. Instead I used the method Counter.most_common, which works like dict.items, except that it returns the items in order of decreasing counts. Then you just need to reverse them to get them in increasing order.

I also made your code more re-usable by encapsulating some of the functionality in functions and guarding the actual code with a if __name__ == "__main__" guard.

I also gave your variables clearer names.

Alternatively, since you are already using numpy and matplotlib, it might be worth it to take a look at pandas. It has a read_json method, which needs only a little preparation (you need to read the whole file at once in a string, though...):

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
with open("toy.json") as f:
 df = pd.read_json("[" + ",".join(f) + "]", orient='records')
values, labels = np.histogram(df["gas"])
above_threshold = np.where(values >= 2)
plt.bar(labels[above_threshold], values[above_threshold], 1000)

While this is way easier, the x-axis is now the actual numerical values, instead of different labels. But for this you could use your plotting again as well. Note that they are not sorted either, but it is usually easier to understand if numerical are in increasing order, instead of in order of increasing counts...

Question 4

Why not json.loads line by line instead of the whole file at once?

Question 5

@MathiasEttinger This is indeed easier to understand, will fix the description as well...

Graipher GraipherGraipher 41.6k7 gold badges70 silver badges134 bronze badges · Answer 1 · 2017-09-15 09:29:04Z

Python has a built-in json module, so you should use that instead of parsing the lines by yourself, no need for regex here.

You can just iterate over the lines of the file (which seem to contain a valid JSON object each), and build a list from that:

import json
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
def at_least(c, threshold):
 """Return a Counter of values which are at_least (>=) threshold"""
 return Counter(el for el in c.elements() if c[el] >= threshold)
def draw_hist(labels, values):
 indices = np.arange(len(labels))
 width = 1
 plt.bar(indices, values, width)
 plt.xticks(indices + width * 0.5, labels)
 plt.show()
if __name__ == "__main__":
 with open("toy.json") as f:
 data = [json.loads(line) for line in f]
 gas = {line["hash"]: int(line["gas"]) for line in data}
 gas_hist_values = at_least(Counter(gas.values()), 2)
 labels, values = zip(*reversed(gas_hist_values.most_common()))
 draw_hist(labels, values)

Note that if a hash appears twice, this code, just like your original code, overwrites it with the value that appears last.

I took out a lot of unnecessary sorting and calls of Counter. Instead I used the method Counter.most_common, which works like dict.items, except that it returns the items in order of decreasing counts. Then you just need to reverse them to get them in increasing order.

I also made your code more re-usable by encapsulating some of the functionality in functions and guarding the actual code with a if __name__ == "__main__" guard.

I also gave your variables clearer names.

Alternatively, since you are already using numpy and matplotlib, it might be worth it to take a look at pandas. It has a read_json method, which needs only a little preparation (you need to read the whole file at once in a string, though...):

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
with open("toy.json") as f:
 df = pd.read_json("[" + ",".join(f) + "]", orient='records')
values, labels = np.histogram(df["gas"])
above_threshold = np.where(values >= 2)
plt.bar(labels[above_threshold], values[above_threshold], 1000)

While this is way easier, the x-axis is now the actual numerical values, instead of different labels. But for this you could use your plotting again as well. Note that they are not sorted either, but it is usually easier to understand if numerical are in increasing order, instead of in order of increasing counts...

Why not json.loads line by line instead of the whole file at once?
@MathiasEttinger This is indeed easier to understand, will fix the description as well...

Stack Exchange Network

Histogram output from parse of json-ish data in python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Histogram output from parse of json-ish data in python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions