\$\begingroup\$
\$\endgroup\$
I have the following fully working code that
- imports JSON files,
- parses the tweets contained in JSONs,
- records them in a table in a data frame.
Considering that per run I currently analyze 1,400 JSONs (about 1.5Gb), the code takes quite some time to run. Please suggest if there is a plausible way to optimize the code in order to increase its speed. Thanks!
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
tweets = []
for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
for file in files:
if file.endswith('.json'):
print(file)
for line in open(file) :
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
df
Graipher
41.6k7 gold badges70 silver badges134 bronze badges
1 Answer 1
\$\begingroup\$
\$\endgroup\$
3
Just a few quick considerations:
- You have
import os
twice - You are not using
matplotlib
andnumpy
, so theimport
s can go - The line
tweet = tweets[0]
is useless - You're not closing the files you open, you should use the
with
keyword
Two optimizations:
- I'd remove the
print(file)
. This is probably single best optimization you can do - You're already looping once, why do you loop another five times?
What about something like this (not tested!):
from collections import defaultdict
elements_keys = ['ids', 'text', 'lang', 'geo', 'place']
elements = defaultdict(list)
for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
for file in files:
if file.endswith('.json'):
with open(file, 'r') as input_file:
for line in input_file:
try:
tweet = json.loads(line)
for key in elements_keys:
elements[key].append(tweet[key])
except:
continue
df=pd.DataFrame({'Ids': pd.Index(elements['id']),
'Text': pd.Index(elements['text']),
'Lang': pd.Index(elements['lang']),
'Geo': pd.Index(elements['geo']),
'Place': pd.Index(elements['place'])})
df
answered Mar 30, 2017 at 7:03
-
1\$\begingroup\$ ChatterOne, Graipher, thank you for your support and helpful replies. The code suggested above reduced the processing time from 2 hours to 15 minutes––this is a significant achievement. \$\endgroup\$kiton– kiton2017年03月30日 15:12:39 +00:00Commented Mar 30, 2017 at 15:12
-
\$\begingroup\$ I have an additional question in relationship to the code above. From the perspective of AWS EC2 instance type, does the code require compute-, or memory-optimized type of instance? Thank you! \$\endgroup\$kiton– kiton2017年04月02日 20:12:30 +00:00Commented Apr 2, 2017 at 20:12
-
1\$\begingroup\$ @kiton The processing part is not really a big deal and quite "simple" in terms of computing. On the other hand, you're going to store all of your results in memory, so memory would be more important here. But keep in mind that we're talking about 3 or 4 GiB, I doubt you'll see much of a difference between e.g. an X1 and a T2. \$\endgroup\$ChatterOne– ChatterOne2017年04月02日 21:11:03 +00:00Commented Apr 2, 2017 at 21:11
Explore related questions
See similar questions with these tags.
lang-py