I'm trying to optimize the main loop portion of this code, as well as learn any "best practices" insights I can for all of the code. This script currently reads in one large file full of tweets (50MB to 1GB). It uses pandas
to play with the data, and matplotlib
to generate 2D graphs.
Currently, this does not scale well and consumes massive amounts of RAM. To help save on cost/VPS resources, I would like to refine this code (:
Example import file:
{"created_at":"Mon Jan 25 21:41:03 +0000 2016","id":691737570879918080,"id_str":"691737570879918080","text":"Suspect Named in Antarctica \"Billy\" Case #fakeheadlinebot #learntocode #makeatwitterbot #javascript","source":"\u003ca href=\"http:\/\/javascriptiseasy.com\" rel=\"nofollow\"\u003eJavaScript is Easy\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":4382400263,"id_str":"4382400263","name":"JavaScript is Easy","screen_name":"javascriptisez","location":"Your Console","url":"http:\/\/javascriptiseasy.com","description":"Get learning!","protected":false,"verified":false,"followers_count":158,"friends_count":68,"listed_count":140,"favourites_count":11,"statuses_count":37545,"created_at":"Sat Dec 05 11:18:00 +0000 2015","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"FFCC4D","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/673099606348070912\/xNxp4zOt_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/673099606348070912\/xNxp4zOt_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/4382400263\/1449314370","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"fakeheadlinebot","indices":[41,57]},{"text":"learntocode","indices":[58,70]},{"text":"makeatwitterbot","indices":[71,87]},{"text":"javascript","indices":[88,99]}],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1453758063417"}
{"created_at":"Mon Jan 25 21:41:04 +0000 2016","id":691737575044677633,"id_str":"691737575044677633","text":"#jobs #Canada # #Senior Software Engineer - Ruby on Rails: #BC-Richmond, Employer: Move Canada or Top Producer... https:\/\/t.co\/BLD8AYjHA7","source":"\u003ca href=\"http:\/\/twitterfeed.com\" rel=\"nofollow\"\u003etwitterfeed\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":4394450596,"id_str":"4394450596","name":"Finance Jobs","screen_name":"Finance_Jobs_","location":"Weil am Rhein","url":"http:\/\/jobsalibaba.com","description":"#Finance #Jobs #career","protected":false,"verified":false,"followers_count":891,"friends_count":851,"listed_count":154,"favourites_count":0,"statuses_count":7428,"created_at":"Sun Dec 06 13:40:55 +0000 2015","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"de","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/673501770673479680\/BztZ7L5a_normal.png","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/673501770673479680\/BztZ7L5a_normal.png","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"jobs","indices":[0,5]},{"text":"Canada","indices":[6,13]},{"text":"Senior","indices":[16,23]},{"text":"BC","indices":[59,62]}],"urls":[{"url":"https:\/\/t.co\/BLD8AYjHA7","expanded_url":"http:\/\/bit.ly\/1VlO2eV","display_url":"bit.ly\/1VlO2eV","indices":[114,137]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1453758064410"}
...
Imports, Config, and Static Variables:
#!/usr/bin/python
import re # Regular Expression
import sys
import json
import traceback
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from matplotlib import rcParams
## Current Date Time
current_datetime = datetime.now()
# Path to image output directory
input_directory = '/var/www/html/content/data/'
output_directory = '/var/www/html/content/graphs/'
# Set matplot settings
rcParams.update({'figure.autolayout': True})
Main Loop:
tweets_data = []
with open(tweets_data_path) as f:
for i, line in enumerate(f):
try:
## Skip "newline" entries
if i % 2 == 1:
continue
## Load tweets into array
tweet = json.loads(line)
tweets_data.append(tweet)
except Exception as e:
print e
continue
## Total # of tweets captured
print "decoded tweets: ", len(tweets_data)
Playing with Loaded Data:
## New Panda DataFrame
tweets = pd.DataFrame()
## Populate/map DataFrame with data
## tweet.get('text', None) ~= tweet['text'] ?? None
tweets['text'] = map(lambda tweet: tweet.get('text', None), tweets_data)
tweets['lang'] = map(lambda tweet: tweet.get('lang', None), tweets_data)
tweets['country'] = map(lambda tweet: None if tweet.get('place', None) is None else tweet.get('place', {}).get('country'), tweets_data)
## Chart for top 5 languages
tweets_by_lang = tweets['lang'].value_counts()
fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Languages', fontsize=15)
ax.set_ylabel('Number of tweets', fontsize=15)
ax.set_title('Top 5 Languages', fontsize=15, fontweight='bold')
tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
fig.savefig(output_directory + 'top-5-languages-' + str(current_datetime) + '.png')
## Show all of our grids ;)
##plt.show()
1 Answer 1
If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print
call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.
def load_tweets_data():
with open(tweets_data_path) as f:
for line in f:
if f.strip(): # if it is not a blank line
try:
yield json.loads(line)
except Exception as e:
print e
Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.
You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:
# Populate/map DataFrame with data
for tweet in load_tweets_data():
tweets['text'] = tweet.get('text', None)
tweets['lang'] = tweet.get('lang', None)
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')
Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool):
try:
tweets['country'] = tweet['place']['country']
except KeyError:
tweets['country'] = None
That's all you need to change to use generators instead of a huge list.
Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.
Oh, and please use a single #
to start comments instead of ##
.
-
\$\begingroup\$ I like this answer a lot. It constrains me to a single iteration, but provides an amazing performance boost. @SuperBiasedMan's solution in the original question's comments is a good/easy approach to save some space, but this wins out. Curious. Why do you prefer a single
#
over##
? \$\endgroup\$Daniel Brown– Daniel Brown2016年01月27日 19:10:10 +00:00Commented Jan 27, 2016 at 19:10 -
1\$\begingroup\$ Because the BDFL (Benevolent Dictator For Life, a.k.a. Guido van Rossum, the creator of Python) advises it like that in PEP 0008 -- Style Guide for python Code: Comments. :-) \$\endgroup\$Byte Commander– Byte Commander2016年01月27日 19:50:26 +00:00Commented Jan 27, 2016 at 19:50
-
1\$\begingroup\$ I'm sure your last line can be improved:
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')
. You could do atry: tweets['country'] = tweet['place']['country'] except: tweets['country'] = None
(Easier to ask for forgiveness than permission ;-) \$\endgroup\$oliverpool– oliverpool2016年01月27日 20:51:20 +00:00Commented Jan 27, 2016 at 20:51 -
\$\begingroup\$ @oliverpool I added yours as an alternative, but you should only catch
KeyError
s if you're only expectingKeyError
s. Too general except statements are discouraged. Because that block would e.g. also catchKeyboardInterrupt
, which is not desired. \$\endgroup\$Byte Commander– Byte Commander2016年01月27日 21:28:56 +00:00Commented Jan 27, 2016 at 21:28 -
\$\begingroup\$ @ByteCommander you are totally right (I just didn't took enough time to write it completely :-) \$\endgroup\$oliverpool– oliverpool2016年01月27日 23:01:24 +00:00Commented Jan 27, 2016 at 23:01
Explore related questions
See similar questions with these tags.
created_at
data. \$\endgroup\$generator
. What are the pros/cons of using one instead of my current approach? Any suggestions on a good place to read up on them? \$\endgroup\$for
loop, just like over any list or file or string etc. What distinguishes a generator from a list is that you can't access a generator's item through any kind of index, only sequentially one after the other and only once. But while a list stores all its items in memory and each item gets initialized when you create the list, a generator is lazy and creates each element just in time when you want to access it. Therefore, it needs much less memory \$\endgroup\$