7
\$\begingroup\$

I wrote code in python that works slow. Because I am new to python, I am not sure that I am doing everything right. My question is what I can do optimally? About the problem: I have 25 *.json files, each is about 80 MB. Each file just contain json strings. I need make some histogram based on data.

In this part I want create list of all dictionaries ( one dictionary represent json object):

d = [] # filename is list of name of files
for x in filename:
d.extend(map(json.loads, open(x)))

then I want to create list u :

u = []
for x in d:
 s = x['key_1'] # s is sting which I use to get useful value
 t1 = 60*int(s[11:13]) + int(s[14:16])# t1 is useful value
 u.append(t1)

Now I am creating histogram:

plt.hist(u, bins = (max(u) - min(u)))
plt.show()

Any thought and suggestions are appreciated. Thank you!

pacmaninbw
26.2k13 gold badges47 silver badges113 bronze badges
asked Feb 10, 2012 at 3:51
\$\endgroup\$
1
  • 2
    \$\begingroup\$ More problematic than the title, for a data-centric question, is the lack of data. We can infer it from the program ([key_1]) but it will only ever be a guess. \$\endgroup\$ Commented Dec 1, 2024 at 13:32

4 Answers 4

5
\$\begingroup\$

You might be able to save some run time by using a couple of generator expressions and a list comprehension. For example:

def read_json_file(name):
 with open(name, "r") as f:
 return json.load(f)
def compute(s):
 return 60 * int(s[11:13]) + int(s[14:16])
d = (read_json_file(n) for n in filename)
u = list(compute(x['key_1']) for x in d)
plt.hist(u, bins = (max(u) - min(u)))
plt.show()

This should save on memory, since anything that isn't needed is discarded.

Edit: it's difficult to discern from the information available, but I think the OP's json files contain multiple json objects, so calling json.load(f) won't work. If that is the case, then this code should fix the problem

def read_json_file(name):
 "Return an iterable of objects loaded from the json file 'name'"
 with open(name, "r") as f:
 for s in f:
 yield json.loads(s)
def compute(s):
 return 60 * int(s[11:13]) + int(s[14:16])
# d is a generator yielding an iterable at each iteration
d = (read_json_file(n) for n in filename)
# j is the flattened version of d
j = (obj for iterable in d for obj in iterable)
u = list(compute(x['key_1']) for x in j)
plt.hist(u, bins = (max(u) - min(u)))
plt.show()
answered Feb 10, 2012 at 4:30
\$\endgroup\$
8
  • \$\begingroup\$ your code gives error : ValueError: Extra data: line 2 column 1 - line 131944 column 1 (char 907 - 96281070) \$\endgroup\$ Commented Feb 10, 2012 at 6:05
  • \$\begingroup\$ I created a couple of test json files and ran this code in Python 2.7 and it worked fine for me. The error appears to be in reading the json file, but without seeing your actual code and the content of your json files, its very difficult for me to diagnose the problem. \$\endgroup\$ Commented Feb 10, 2012 at 6:16
  • \$\begingroup\$ 2srgerg it returns error from read_json_file function \$\endgroup\$ Commented Feb 10, 2012 at 6:24
  • \$\begingroup\$ here my code: def read_json_file(name): with open(name,'r') as f: return json.loads(f.read()) def compute_time(s): return 60 * int(s[11:13]) + int(s[14:16]) d = (read_json_file(n) for n in filename) u = list(map(compute_time, (x['time'] for x in d))) \$\endgroup\$ Commented Feb 10, 2012 at 6:26
  • 1
    \$\begingroup\$ There is a great piece about generators and doing this exact kind of work available here - dabeaz.com/generators \$\endgroup\$ Commented Feb 10, 2012 at 8:32
10
\$\begingroup\$

Python uses a surprisingly large amount of memory when reading files, often 3-4 times the actual file size. You never close each file after you open it, so all of that memory is still in use later in the program.

Try changing the flow of your program to

  1. Open a file
  2. Compute a histogram for that file
  3. Close the file
  4. Merge it with a "global" histogram
  5. Repeat until there are no files left.

Something like

u = []
for f in filenames:
 with open(f) as file:
 # process individual file contents
 contents = file.read()
 data = json.loads(contents)
 for obj in data:
 s = obj['key_1']
 t1 = 60 * int(s[11:13]) + int(s[14:16])
 u.append(t1)
# make the global histogram
plt.hist(u, bins = (max(u) - min(u)))
plt.show()

with open as automatically closes files when you're done, and handles cases where the file can't be read or there are other errors.

answered Feb 10, 2012 at 4:11
\$\endgroup\$
6
\$\begingroup\$

Using mind-reading (and a stray comment from the OP), I can tell that this:

60*int(s[11:13]) + int(s[14:16])

is a string-formatted time. Because the OP is no longer registered - and didn't leave us with sample data - and none of the existing answers have picked up on the time element, let's invent a datetime format that would match:

0123456789012345678 # index
yyyy-mm-ddThh:mm:ss

IOW, ISO8601.

Just use Pandas. As a trivial example,

import io
import pandas as pd
with io.StringIO('''
[
 {"key1": "2012年02月01日T05:30"},
 {"key1": "2012年02月02日T06:30"},
 {"key1": "2012年02月03日T07:30"}
]
''') as f:
 df = pd.read_json(
 f,
 convert_dates=['key1'],
 )
print(df)
 key1
0 2012年02月01日 05:30:00
1 2012年02月02日 06:30:00
2 2012年02月03日 07:30:00
answered Dec 1, 2024 at 13:51
\$\endgroup\$
2
\$\begingroup\$

I'd use this, as it avoids loading and keeping all of json data in memory:

u = []
for name in filename:
 d = json.load(open(name,"r"))
 for x in d:
 s = x['key_1']
 t1 = 60*int(s[11:13]) + int(s[14:16])
 u.append(t1)
 d = None
plt.hist(u, bins = (max(u) - min(u)))
plt.show()
answered Feb 10, 2012 at 4:11
\$\endgroup\$
2
  • \$\begingroup\$ Don't you want json.load(open(name, "r")) instead of loads since the latter takes a string as its argument? \$\endgroup\$ Commented Feb 10, 2012 at 4:15
  • \$\begingroup\$ @srgerg Yup. :/ \$\endgroup\$ Commented Feb 10, 2012 at 4:17

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.