I wrote code in python that works slow. Because I am new to python, I am not sure that I am doing everything right. My question is what I can do optimally? About the problem: I have 25 *.json files, each is about 80 MB. Each file just contain json strings. I need make some histogram based on data.
In this part I want create list of all dictionaries ( one dictionary represent json object):
d = [] # filename is list of name of files
for x in filename:
d.extend(map(json.loads, open(x)))
then I want to create list u
:
u = []
for x in d:
s = x['key_1'] # s is sting which I use to get useful value
t1 = 60*int(s[11:13]) + int(s[14:16])# t1 is useful value
u.append(t1)
Now I am creating histogram:
plt.hist(u, bins = (max(u) - min(u)))
plt.show()
Any thought and suggestions are appreciated. Thank you!
4 Answers 4
You might be able to save some run time by using a couple of generator expressions and a list comprehension. For example:
def read_json_file(name):
with open(name, "r") as f:
return json.load(f)
def compute(s):
return 60 * int(s[11:13]) + int(s[14:16])
d = (read_json_file(n) for n in filename)
u = list(compute(x['key_1']) for x in d)
plt.hist(u, bins = (max(u) - min(u)))
plt.show()
This should save on memory, since anything that isn't needed is discarded.
Edit: it's difficult to discern from the information available, but I think the OP's json files contain multiple json objects, so calling json.load(f)
won't work. If that is the case, then this code should fix the problem
def read_json_file(name):
"Return an iterable of objects loaded from the json file 'name'"
with open(name, "r") as f:
for s in f:
yield json.loads(s)
def compute(s):
return 60 * int(s[11:13]) + int(s[14:16])
# d is a generator yielding an iterable at each iteration
d = (read_json_file(n) for n in filename)
# j is the flattened version of d
j = (obj for iterable in d for obj in iterable)
u = list(compute(x['key_1']) for x in j)
plt.hist(u, bins = (max(u) - min(u)))
plt.show()
-
\$\begingroup\$ your code gives error : ValueError: Extra data: line 2 column 1 - line 131944 column 1 (char 907 - 96281070) \$\endgroup\$capoluca– capoluca2012年02月10日 06:05:43 +00:00Commented Feb 10, 2012 at 6:05
-
\$\begingroup\$ I created a couple of test json files and ran this code in Python 2.7 and it worked fine for me. The error appears to be in reading the json file, but without seeing your actual code and the content of your json files, its very difficult for me to diagnose the problem. \$\endgroup\$srgerg– srgerg2012年02月10日 06:16:39 +00:00Commented Feb 10, 2012 at 6:16
-
\$\begingroup\$ 2srgerg it returns error from read_json_file function \$\endgroup\$capoluca– capoluca2012年02月10日 06:24:55 +00:00Commented Feb 10, 2012 at 6:24
-
\$\begingroup\$ here my code: def read_json_file(name): with open(name,'r') as f: return json.loads(f.read()) def compute_time(s): return 60 * int(s[11:13]) + int(s[14:16]) d = (read_json_file(n) for n in filename) u = list(map(compute_time, (x['time'] for x in d))) \$\endgroup\$capoluca– capoluca2012年02月10日 06:26:12 +00:00Commented Feb 10, 2012 at 6:26
-
1\$\begingroup\$ There is a great piece about generators and doing this exact kind of work available here - dabeaz.com/generators \$\endgroup\$Darb– Darb2012年02月10日 08:32:17 +00:00Commented Feb 10, 2012 at 8:32
Python uses a surprisingly large amount of memory when reading files, often 3-4 times the actual file size. You never close each file after you open it, so all of that memory is still in use later in the program.
Try changing the flow of your program to
- Open a file
- Compute a histogram for that file
- Close the file
- Merge it with a "global" histogram
- Repeat until there are no files left.
Something like
u = []
for f in filenames:
with open(f) as file:
# process individual file contents
contents = file.read()
data = json.loads(contents)
for obj in data:
s = obj['key_1']
t1 = 60 * int(s[11:13]) + int(s[14:16])
u.append(t1)
# make the global histogram
plt.hist(u, bins = (max(u) - min(u)))
plt.show()
with open as
automatically closes files when you're done, and handles cases where the file can't be read or there are other errors.
Using mind-reading (and a stray comment from the OP), I can tell that this:
60*int(s[11:13]) + int(s[14:16])
is a string-formatted time. Because the OP is no longer registered - and didn't leave us with sample data - and none of the existing answers have picked up on the time element, let's invent a datetime format that would match:
0123456789012345678 # index
yyyy-mm-ddThh:mm:ss
IOW, ISO8601.
Just use Pandas. As a trivial example,
import io
import pandas as pd
with io.StringIO('''
[
{"key1": "2012年02月01日T05:30"},
{"key1": "2012年02月02日T06:30"},
{"key1": "2012年02月03日T07:30"}
]
''') as f:
df = pd.read_json(
f,
convert_dates=['key1'],
)
print(df)
key1
0 2012年02月01日 05:30:00
1 2012年02月02日 06:30:00
2 2012年02月03日 07:30:00
I'd use this, as it avoids loading and keeping all of json data in memory:
u = []
for name in filename:
d = json.load(open(name,"r"))
for x in d:
s = x['key_1']
t1 = 60*int(s[11:13]) + int(s[14:16])
u.append(t1)
d = None
plt.hist(u, bins = (max(u) - min(u)))
plt.show()
-
\$\begingroup\$ Don't you want
json.load(open(name, "r"))
instead ofloads
since the latter takes a string as its argument? \$\endgroup\$srgerg– srgerg2012年02月10日 04:15:43 +00:00Commented Feb 10, 2012 at 4:15 -
\$\begingroup\$ @srgerg Yup. :/ \$\endgroup\$Dan D.– Dan D.2012年02月10日 04:17:24 +00:00Commented Feb 10, 2012 at 4:17
[key_1]
) but it will only ever be a guess. \$\endgroup\$