Improving Twitter JSON parser for big data set

Question 1

I have the following fully working code that

imports JSON files,
parses the tweets contained in JSONs,
records them in a table in a data frame.

Considering that per run I currently analyze 1,400 JSONs (about 1.5Gb), the code takes quite some time to run. Please suggest if there is a plausible way to optimize the code in order to increase its speed. Thanks!

import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
tweets = []
for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
 for file in files:
 if file.endswith('.json'):
 print(file)
 for line in open(file) :
 try:
 tweet = json.loads(line)
 tweets.append(tweet)
 except:
 continue
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] 
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet] 
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
df=pd.DataFrame({'Ids':pd.Index(ids),
 'Text':pd.Index(text),
 'Lang':pd.Index(lang),
 'Geo':pd.Index(geo),
 'Place':pd.Index(place)})
df

Question 2

Just a few quick considerations:

You have import os twice
You are not using matplotlib and numpy, so the imports can go
The line tweet = tweets[0] is useless
You're not closing the files you open, you should use the with keyword

Two optimizations:

I'd remove the print(file). This is probably single best optimization you can do
You're already looping once, why do you loop another five times?

What about something like this (not tested!):

from collections import defaultdict
elements_keys = ['ids', 'text', 'lang', 'geo', 'place']
elements = defaultdict(list)
for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
 for file in files:
 if file.endswith('.json'):
 with open(file, 'r') as input_file:
 for line in input_file:
 try:
 tweet = json.loads(line)
 for key in elements_keys:
 elements[key].append(tweet[key])
 except:
 continue
df=pd.DataFrame({'Ids': pd.Index(elements['id']),
 'Text': pd.Index(elements['text']),
 'Lang': pd.Index(elements['lang']),
 'Geo': pd.Index(elements['geo']),
 'Place': pd.Index(elements['place'])})
df

Question 3

ChatterOne, Graipher, thank you for your support and helpful replies. The code suggested above reduced the processing time from 2 hours to 15 minutes––this is a significant achievement.

Question 4

I have an additional question in relationship to the code above. From the perspective of AWS EC2 instance type, does the code require compute-, or memory-optimized type of instance? Thank you!

Question 5

@kiton The processing part is not really a big deal and quite "simple" in terms of computing. On the other hand, you're going to store all of your results in memory, so memory would be more important here. But keep in mind that we're talking about 3 or 4 GiB, I doubt you'll see much of a difference between e.g. an X1 and a T2.

ChatterOne ChatterOneChatterOne 2,84512 silver badges18 bronze badges · Accepted Answer · 2017-03-30 07:03:41Z

Just a few quick considerations:

You have import os twice
You are not using matplotlib and numpy, so the imports can go
The line tweet = tweets[0] is useless
You're not closing the files you open, you should use the with keyword

Two optimizations:

I'd remove the print(file). This is probably single best optimization you can do
You're already looping once, why do you loop another five times?

What about something like this (not tested!):

from collections import defaultdict
elements_keys = ['ids', 'text', 'lang', 'geo', 'place']
elements = defaultdict(list)
for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
 for file in files:
 if file.endswith('.json'):
 with open(file, 'r') as input_file:
 for line in input_file:
 try:
 tweet = json.loads(line)
 for key in elements_keys:
 elements[key].append(tweet[key])
 except:
 continue
df=pd.DataFrame({'Ids': pd.Index(elements['id']),
 'Text': pd.Index(elements['text']),
 'Lang': pd.Index(elements['lang']),
 'Geo': pd.Index(elements['geo']),
 'Place': pd.Index(elements['place'])})
df

ChatterOne, Graipher, thank you for your support and helpful replies. The code suggested above reduced the processing time from 2 hours to 15 minutes––this is a significant achievement.
I have an additional question in relationship to the code above. From the perspective of AWS EC2 instance type, does the code require compute-, or memory-optimized type of instance? Thank you!
@kiton The processing part is not really a big deal and quite "simple" in terms of computing. On the other hand, you're going to store all of your results in memory, so memory would be more important here. But keep in mind that we're talking about 3 or 4 GiB, I doubt you'll see much of a difference between e.g. an X1 and a T2.

Stack Exchange Network

Improving Twitter JSON parser for big data set

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Improving Twitter JSON parser for big data set

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions