If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print
call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.
def load_tweets_data():
with open(tweets_data_path) as f:
for line in f:
if f.strip(): # if it is not a blank line
try:
yield json.loads(line)
except Exception as e:
print e
Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.
You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:
# Populate/map DataFrame with data
for tweet in load_tweets_data():
tweets['text'] = tweet.get('text', None)
tweets['lang'] = tweet.get('lang', None)
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')
Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool oliverpool):
try:
tweets['country'] = tweet['place']['country']
except KeyError:
tweets['country'] = None
That's all you need to change to use generators instead of a huge list.
Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.
Oh, and please use a single #
to start comments instead of ##
.
If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print
call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.
def load_tweets_data():
with open(tweets_data_path) as f:
for line in f:
if f.strip(): # if it is not a blank line
try:
yield json.loads(line)
except Exception as e:
print e
Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.
You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:
# Populate/map DataFrame with data
for tweet in load_tweets_data():
tweets['text'] = tweet.get('text', None)
tweets['lang'] = tweet.get('lang', None)
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')
Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool):
try:
tweets['country'] = tweet['place']['country']
except KeyError:
tweets['country'] = None
That's all you need to change to use generators instead of a huge list.
Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.
Oh, and please use a single #
to start comments instead of ##
.
If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print
call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.
def load_tweets_data():
with open(tweets_data_path) as f:
for line in f:
if f.strip(): # if it is not a blank line
try:
yield json.loads(line)
except Exception as e:
print e
Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.
You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:
# Populate/map DataFrame with data
for tweet in load_tweets_data():
tweets['text'] = tweet.get('text', None)
tweets['lang'] = tweet.get('lang', None)
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')
Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool):
try:
tweets['country'] = tweet['place']['country']
except KeyError:
tweets['country'] = None
That's all you need to change to use generators instead of a huge list.
Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.
Oh, and please use a single #
to start comments instead of ##
.
If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print
call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.
def load_tweets_data():
with open(tweets_data_path) as f:
for line in f:
if f.strip(): # if it is not a blank line
try:
yield json.loads(line)
except Exception as e:
print e
Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.
You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:
# Populate/map DataFrame with data
for tweet in load_tweets_data():
tweets['text'] = tweet.get('text', None)
tweets['lang'] = tweet.get('lang', None)
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')
Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool ):
try:
tweets['country'] = tweet['place']['country']
except KeyError:
tweets['country'] = None
That's all you need to change to use generators instead of a huge list.
Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.
Oh, and please use a single #
to start comments instead of ##
.
If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print
call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.
def load_tweets_data():
with open(tweets_data_path) as f:
for line in f:
if f.strip(): # if it is not a blank line
try:
yield json.loads(line)
except Exception as e:
print e
Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.
You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:
# Populate/map DataFrame with data
for tweet in load_tweets_data():
tweets['text'] = tweet.get('text', None)
tweets['lang'] = tweet.get('lang', None)
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')
That's all you need to change to use generators instead of a huge list.
Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.
Oh, and please use a single #
to start comments instead of ##
.
If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print
call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.
def load_tweets_data():
with open(tweets_data_path) as f:
for line in f:
if f.strip(): # if it is not a blank line
try:
yield json.loads(line)
except Exception as e:
print e
Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.
You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:
# Populate/map DataFrame with data
for tweet in load_tweets_data():
tweets['text'] = tweet.get('text', None)
tweets['lang'] = tweet.get('lang', None)
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')
Alternatively to the last line of the snippet above, you could also use this (thanks to @oliverpool ):
try:
tweets['country'] = tweet['place']['country']
except KeyError:
tweets['country'] = None
That's all you need to change to use generators instead of a huge list.
Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.
Oh, and please use a single #
to start comments instead of ##
.
If you don't need to know the exact number of loaded tweets in your Main Loop part (and omit the print
call there), you can use a generator instead of a list. That way, the program will load and process each line of the file just in time instead of allocating a huge block of memory to store a list of all items.
def load_tweets_data():
with open(tweets_data_path) as f:
for line in f:
if f.strip(): # if it is not a blank line
try:
yield json.loads(line)
except Exception as e:
print e
Note that I also eliminated your approach to only read every first out of two lines, which is far too inflexible. I replaced it with a simple test whether the line contains any non-whitespace character.
You have to modify your Playing with Loaded Data: # Populate/map DataFrame with data part then as well though, because you can get each generator item only once. That means you have to perform all analyses once per item instead of all items once per analyse. It could look like this:
# Populate/map DataFrame with data
for tweet in load_tweets_data():
tweets['text'] = tweet.get('text', None)
tweets['lang'] = tweet.get('lang', None)
tweets['country'] = None if tweet.get('place', None) is None else tweet.get('place', {}).get('country')
That's all you need to change to use generators instead of a huge list.
Alternatively, you could have placed the code to populate the DataFrame directly in the loop you use to read the file.
Oh, and please use a single #
to start comments instead of ##
.