Return to Answer

replaced http://codereview.stackexchange.com/ with https://codereview.stackexchange.com/

edited Apr 13, 2017 at 12:40

If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.

def load_tweets_data():
 with open(tweets_data_path) as f:
 for line in f:
 if f.strip(): # if it is not a blank line
 try:
 yield json.loads(line)
 except Exception as e:
 print e

Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.

You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:

# Populate/map DataFrame with data
for tweet in load_tweets_data():
 tweets['text'] = tweet.get('text', None)
 tweets['lang'] = tweet.get('lang', None)
 tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')

Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool oliverpool):

try: 
 tweets['country'] = tweet['place']['country'] 
except KeyError: 
 tweets['country'] = None

That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

def load_tweets_data():
 with open(tweets_data_path) as f:
 for line in f:
 if f.strip(): # if it is not a blank line
 try:
 yield json.loads(line)
 except Exception as e:
 print e

# Populate/map DataFrame with data
for tweet in load_tweets_data():
 tweets['text'] = tweet.get('text', None)
 tweets['lang'] = tweet.get('lang', None)
 tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')

Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool):

try: 
 tweets['country'] = tweet['place']['country'] 
except KeyError: 
 tweets['country'] = None

That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

def load_tweets_data():
 with open(tweets_data_path) as f:
 for line in f:
 if f.strip(): # if it is not a blank line
 try:
 yield json.loads(line)
 except Exception as e:
 print e

# Populate/map DataFrame with data
for tweet in load_tweets_data():
 tweets['text'] = tweet.get('text', None)
 tweets['lang'] = tweet.get('lang', None)
 tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')

Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool):

try: 
 tweets['country'] = tweet['place']['country'] 
except KeyError: 
 tweets['country'] = None

That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

added 221 characters in body

Source Link

edited Jan 27, 2016 at 21:23

Byte Commander

edited Jan 27, 2016 at 21:23

Byte Commander

def load_tweets_data():
 with open(tweets_data_path) as f:
 for line in f:
 if f.strip(): # if it is not a blank line
 try:
 yield json.loads(line)
 except Exception as e:
 print e

# Populate/map DataFrame with data
for tweet in load_tweets_data():
 tweets['text'] = tweet.get('text', None)
 tweets['lang'] = tweet.get('lang', None)
 tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')

Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool ):

try: 
 tweets['country'] = tweet['place']['country'] 
except KeyError: 
 tweets['country'] = None

That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

def load_tweets_data():
 with open(tweets_data_path) as f:
 for line in f:
 if f.strip(): # if it is not a blank line
 try:
 yield json.loads(line)
 except Exception as e:
 print e

# Populate/map DataFrame with data
for tweet in load_tweets_data():
 tweets['text'] = tweet.get('text', None)
 tweets['lang'] = tweet.get('lang', None)
 tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')

That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

def load_tweets_data():
 with open(tweets_data_path) as f:
 for line in f:
 if f.strip(): # if it is not a blank line
 try:
 yield json.loads(line)
 except Exception as e:
 print e

# Populate/map DataFrame with data
for tweet in load_tweets_data():
 tweets['text'] = tweet.get('text', None)
 tweets['lang'] = tweet.get('lang', None)
 tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')

Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool ):

try: 
 tweets['country'] = tweet['place']['country'] 
except KeyError: 
 tweets['country'] = None

That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

Source Link

answered Jan 27, 2016 at 18:53

Byte Commander

answered Jan 27, 2016 at 18:53

Byte Commander

def load_tweets_data():
 with open(tweets_data_path) as f:
 for line in f:
 if f.strip(): # if it is not a blank line
 try:
 yield json.loads(line)
 except Exception as e:
 print e

# Populate/map DataFrame with data
for tweet in load_tweets_data():
 tweets['text'] = tweet.get('text', None)
 tweets['lang'] = tweet.get('lang', None)
 tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')

That's all you need to change to use generators instead of a huge list.

Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.

Oh, and please use a single # to start comments instead of ##.

lang-py