Data analytics on static file of 50,000+ tweets

Question 1

I'm trying to optimize the main loop portion of this code, as well as learn any "best practices" insights I can for all of the code. This script currently reads in one large file full of tweets (50MB to 1GB). It uses pandas to play with the data, and matplotlib to generate 2D graphs.

Currently, this does not scale well and consumes massive amounts of RAM. To help save on cost/VPS resources, I would like to refine this code (:

Example import file:

{"created_at":"Mon Jan 25 21:41:03 +0000 2016","id":691737570879918080,"id_str":"691737570879918080","text":"Suspect Named in Antarctica \"Billy\" Case #fakeheadlinebot #learntocode #makeatwitterbot #javascript","source":"\u003ca href=\"http:\/\/javascriptiseasy.com\" rel=\"nofollow\"\u003eJavaScript is Easy\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":4382400263,"id_str":"4382400263","name":"JavaScript is Easy","screen_name":"javascriptisez","location":"Your Console","url":"http:\/\/javascriptiseasy.com","description":"Get learning!","protected":false,"verified":false,"followers_count":158,"friends_count":68,"listed_count":140,"favourites_count":11,"statuses_count":37545,"created_at":"Sat Dec 05 11:18:00 +0000 2015","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"FFCC4D","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/673099606348070912\/xNxp4zOt_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/673099606348070912\/xNxp4zOt_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/4382400263\/1449314370","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"fakeheadlinebot","indices":[41,57]},{"text":"learntocode","indices":[58,70]},{"text":"makeatwitterbot","indices":[71,87]},{"text":"javascript","indices":[88,99]}],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1453758063417"}
{"created_at":"Mon Jan 25 21:41:04 +0000 2016","id":691737575044677633,"id_str":"691737575044677633","text":"#jobs #Canada # #Senior Software Engineer - Ruby on Rails: #BC-Richmond, Employer: Move Canada or Top Producer... https:\/\/t.co\/BLD8AYjHA7","source":"\u003ca href=\"http:\/\/twitterfeed.com\" rel=\"nofollow\"\u003etwitterfeed\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":4394450596,"id_str":"4394450596","name":"Finance Jobs","screen_name":"Finance_Jobs_","location":"Weil am Rhein","url":"http:\/\/jobsalibaba.com","description":"#Finance #Jobs #career","protected":false,"verified":false,"followers_count":891,"friends_count":851,"listed_count":154,"favourites_count":0,"statuses_count":7428,"created_at":"Sun Dec 06 13:40:55 +0000 2015","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"de","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/673501770673479680\/BztZ7L5a_normal.png","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/673501770673479680\/BztZ7L5a_normal.png","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"jobs","indices":[0,5]},{"text":"Canada","indices":[6,13]},{"text":"Senior","indices":[16,23]},{"text":"BC","indices":[59,62]}],"urls":[{"url":"https:\/\/t.co\/BLD8AYjHA7","expanded_url":"http:\/\/bit.ly\/1VlO2eV","display_url":"bit.ly\/1VlO2eV","indices":[114,137]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1453758064410"}
...

Imports, Config, and Static Variables:

#!/usr/bin/python
import re # Regular Expression
import sys
import json
import traceback
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from matplotlib import rcParams
## Current Date Time
current_datetime = datetime.now()
# Path to image output directory
input_directory = '/var/www/html/content/data/'
output_directory = '/var/www/html/content/graphs/'
# Set matplot settings
rcParams.update({'figure.autolayout': True})

Main Loop:

tweets_data = []
with open(tweets_data_path) as f:
 for i, line in enumerate(f):
 try:
 ## Skip "newline" entries
 if i % 2 == 1:
 continue
 ## Load tweets into array
 tweet = json.loads(line)
 tweets_data.append(tweet)
 except Exception as e:
 print e
 continue
## Total # of tweets captured
print "decoded tweets: ", len(tweets_data)

Playing with Loaded Data:

## New Panda DataFrame
tweets = pd.DataFrame()
## Populate/map DataFrame with data
## tweet.get('text', None) ~= tweet['text'] ?? None
tweets['text'] = map(lambda tweet: tweet.get('text', None), tweets_data)
tweets['lang'] = map(lambda tweet: tweet.get('lang', None), tweets_data)
tweets['country'] = map(lambda tweet: None if tweet.get('place', None) is None else tweet.get('place', {}).get('country'), tweets_data)
## Chart for top 5 languages
tweets_by_lang = tweets['lang'].value_counts()
fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Languages', fontsize=15)
ax.set_ylabel('Number of tweets', fontsize=15)
ax.set_title('Top 5 Languages', fontsize=15, fontweight='bold')
tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
fig.savefig(output_directory + 'top-5-languages-' + str(current_datetime) + '.png')
## Show all of our grids ;)
##plt.show()

Question 2

Are you just using the text, lang and country fields from the tweets? If so that's all you should be storing in your dicts, it'd save a lot of RAM to remove things like the created_at data.

Question 3

@SuperBiasedMan, Hmm.. That's a good thought. We can decrease the amount of in-memory data by mapping each entry to a slimmer schema that only has the fields we care about.

Question 4

Do you need to know how many tweets got loaded in the main loop? You could replace the list with a generator to reduce memory consumption dramatically. But generators don't give any information about their length...

Question 5

Currently, I do not need to know how many get loaded. I'm unfamiliar with the term generator. What are the pros/cons of using one instead of my current approach? Any suggestions on a good place to read up on them?

Question 6

@DanielBrown wiki.python.org/moin/Generators and stackoverflow.com/q/102535/4464570 look good. You iterate over a generator using a for loop, just like over any list or file or string etc. What distinguishes a generator from a list is that you can't access a generator's item through any kind of index, only sequentially one after the other and only once. But while a list stores all its items in memory and each item gets initialized when you create the list, a generator is lazy and creates each element just in time when you want to access it. Therefore, it needs much less memory

Question 7

If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.

def load_tweets_data():
 with open(tweets_data_path) as f:
 for line in f:
 if f.strip(): # if it is not a blank line
 try:
 yield json.loads(line)
 except Exception as e:
 print e

Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.

You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:

# Populate/map DataFrame with data
for tweet in load_tweets_data():
 tweets['text'] = tweet.get('text', None)
 tweets['lang'] = tweet.get('lang', None)
 tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')

Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool):

try: 
 tweets['country'] = tweet['place']['country'] 
except KeyError: 
 tweets['country'] = None

That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

Question 8

I like this answer a lot. It constrains me to a single iteration, but provides an amazing performance boost. @SuperBiasedMan's solution in the original question's comments is a good/easy approach to save some space, but this wins out. Curious. Why do you prefer a single # over ##?

Question 9

Because the BDFL (Benevolent Dictator For Life, a.k.a. Guido van Rossum, the creator of Python) advises it like that in PEP 0008 -- Style Guide for python Code: Comments. :-)

Question 10

I'm sure your last line can be improved: tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country'). You could do a try: tweets['country'] = tweet['place']['country'] except: tweets['country'] = None (Easier to ask for forgiveness than permission ;-)

Question 11

@oliverpool I added yours as an alternative, but you should only catch KeyErrors if you're only expecting KeyErrors. Too general except statements are discouraged. Because that block would e.g. also catch KeyboardInterrupt, which is not desired.

Question 12

@ByteCommander you are totally right (I just didn't took enough time to write it completely :-)

Byte Commander Byte Commander 2301 gold badge4 silver badges12 bronze badges · Accepted Answer · 2016-01-27 18:53:55Z

If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.

def load_tweets_data():
 with open(tweets_data_path) as f:
 for line in f:
 if f.strip(): # if it is not a blank line
 try:
 yield json.loads(line)
 except Exception as e:
 print e

Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.

You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:

# Populate/map DataFrame with data
for tweet in load_tweets_data():
 tweets['text'] = tweet.get('text', None)
 tweets['lang'] = tweet.get('lang', None)
 tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')

Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool):

try: 
 tweets['country'] = tweet['place']['country'] 
except KeyError: 
 tweets['country'] = None

That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

I like this answer a lot. It constrains me to a single iteration, but provides an amazing performance boost. @SuperBiasedMan's solution in the original question's comments is a good/easy approach to save some space, but this wins out. Curious. Why do you prefer a single # over ##?
Because the BDFL (Benevolent Dictator For Life, a.k.a. Guido van Rossum, the creator of Python) advises it like that in PEP 0008 -- Style Guide for python Code: Comments. :-)
I'm sure your last line can be improved: tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country'). You could do a try: tweets['country'] = tweet['place']['country'] except: tweets['country'] = None (Easier to ask for forgiveness than permission ;-)
@oliverpool I added yours as an alternative, but you should only catch KeyErrors if you're only expecting KeyErrors. Too general except statements are discouraged. Because that block would e.g. also catch KeyboardInterrupt, which is not desired.
@ByteCommander you are totally right (I just didn't took enough time to write it completely :-)

Stack Exchange Network

Data analytics on static file of 50,000+ tweets

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Data analytics on static file of 50,000+ tweets

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions