Models for analysing Twitter data (*which based on tweepy package)

Question 1

I have three modules here models.py, functionals.py, and adjustment.py. These three may be used to perform analysis of Twitter data through tweepy objects.

The Tweets class is a simple one which takes 'tweet' objects and take it as a 'list', which is necessary (not yet implemented as necessary) as an argument of Authors class.

The Authors class is used to perform analysis of tweets by their authors/users. For example, we can plot the number of Followers of each user appearing in the data (the plot is by hbar_plot function).

Here is an example usage :

import numpy
from models import Authors
from models import Tweets
import matplotlib.pyplot as plt
stats = numpy.load('testfile.npy')
tweets = Tweets(stats)
Model = Authors(tweets)
fig, ax = plt.subplots(1, 1)
Model.hbar_plot(ax, measurement = 'Followers', incolor_measurement = 'Following', height = 0.5, color = (1,0,0,1))
fig, ax = plt.subplots(1, 1)
Model.hbar_plot(ax, measurement = 'Sample Tweets', incolor_measurement = 'Followers', height = 0.5, color = (1,0,0,1))
plt.show()

The example above plots the barplots, with bar colors representing the incolor_measurement variable.

The testfile.npy is available in https://github.com/anbarief/statistweepy/blob/master/examples/testfile.npy . This file consists of about 200 'tweet' objects of tweepy, collected a few days ago.

My question is, how to make better organization of the classes and their methods, also the functions here? How to write and organize it better?

The modules :

adjustment.py:

import copy
class Adjustment(object):
 def __init__(self, title = "an adjustment.", **kwargs):
 self.title = title
 if 'freq_lim' in kwargs:
 self.freq_lim = kwargs['freq_lim']
 else:
 self.freq_lim = None
 if 'char_lim' in kwargs:
 self.char_lim = kwargs['char_lim']
 else:
 self.char_lim = None
 if 'exclude' in kwargs:
 self.exclude = kwargs['exclude']
 else:
 self.exclude = None

functionals.py:

def filter_unique(tweet_stats_list, output = 'status'):
 stats = tweet_stats_list
 unique = []
 for tweet in stats:
 try:
 if not (tweet.retweeted_status in unique):
 if output == 'status':
 unique.append(tweet.retweeted_status)
 elif output == 'text':
 unique.append(tweet.retweeted_status.text)
 except:
 if not (tweet in unique):
 if output == 'status':
 unique.append(tweet)
 elif output == 'text':
 unique.append(tweet.text)
 return unique
def split_texts(texts, adjustment = None):
 split_list = []
 for text in texts:
 split = text.split()
 split_list.extend(split)
 return split_list

models.py:

import copy
import numpy
import matplotlib.pyplot as plt
from matplotlib import colors as mplcolors
from . import adjustment as adjust
from . import functionals as func
class Tweets(list):
 """
 Tweets model.
 """
 def __init__(self, *args, filter_by_unique = False, **kwargs):
 if filter_by_unique:
 tweets = func.filter_unique(args[0])
 else:
 tweets = args[0]
 list.__init__(self, tweets, **kwargs)
 @property
 def sorted_by_time(self):
 return sorted(self, key = lambda x: x.created_at)
 @property
 def oldest(self):
 return self.sorted_by_time[0]
 @property
 def newest(self):
 return self.sorted_by_time[-1]
class Authors(object):
 """
 Authors model.
 """
 def __init__(self, tweets):
 self.tweets = tweets
 self.authors = {author.name : author for author in list(set([tweet.author for tweet in self.tweets]))}
 self.username = {author.screen_name : author for author in list(set([tweet.author for tweet in self.tweets]))}
 self.followers_count = {author: self.authors[author].followers_count for author in self.authors}
 self.following_count = {author: self.authors[author].friends_count for author in self.authors}
 self.totaltweets = {author: self.authors[author].statuses_count for author in self.authors}
 self.tweets_by_author = {author: [tweet for tweet in self.tweets if tweet.author.name == author] for author in self.authors}
 self.tweets_count = {author: len(self.tweets_by_author[author]) for author in self.tweets_by_author}
 def hbar_plot(self, ax, measurement = 'Followers', color = (0,0,1,1), incolor_measurement = None, height = 1, textsize = 7, **kwargs):
 measurements = {'Followers': self.followers_count, 'Following' : self.following_count, 'Total Tweets' : self.totaltweets, 'Sample Tweets' : self.tweets_count}
 sorted_authors = sorted(measurements[measurement], key = lambda x : measurements[measurement][x])
 if type(color) == str:
 color = mplcolors.hex2color(mplcolors.cnames[color]) 
 colors = len(self.authors)*[color]
 if incolor_measurement != None:
 minor_max = max(measurements[incolor_measurement].values())
 transparency = [measurements[incolor_measurement][author]/minor_max for author in sorted_authors]
 colors = [(color[0], color[1], color[2], trans) for trans in transparency]
 var = [i+height for i in range(len(self.authors))]
 ax.barh([i-height/2 for i in var], [measurements[measurement][author] for author in sorted_authors], height = height, color = colors, **kwargs)
 ax.set_yticks(var)
 ax.set_yticklabels(sorted_authors, rotation = 'horizontal', size = textsize)
 if incolor_measurement != None:
 ax.set_xlabel(measurement + ' (color : '+incolor_measurement+')')
 else :
 ax.set_xlabel(measurement)
 plt.tight_layout()

Result of example usage :

enter image description here

Question 2

You’ve presented 5 units (classes, functions) in 3 different files. I understand that you have a few more in your repository but I think this is over-splitting. You should at least merge adjustment.py into models.py and radial_bar.py into functional.py (and, btw, this kind of file is usually named utils.py).

Out of the 5 units presented here, 2 are unused: split_texts and Adjustment. So a quick word about them:

You can drop the adjustment parameter of split_texts as it is unused;
You can use itertools.chain.from_iterable to flatten nested iterables;
You can simplify optional arguments by dropping **kwargs in favor of default values.

All in all, these two can be condensed to:

import itertools
class Adjustment:
 def __init__(self, title="An adjustment.", freq_lim=None, char_lim=None, exclude=None):
 self.title = title
 self.freq_lim = freq_lim
 self.char_lim = char_lim
 self.exclude = exclude
def split_texts(texts):
 return list(itertools.chain.from_iterable(map(str.split, texts)))

But the Adjustment class would be even more succint and easily extensible if it were one of the new dataclasses:

from dataclasses import dataclass
@dataclass(frozen=True)
class Adjustment:
 title: str = "An adjustment."
 freq_lim: int = None
 char_lim: int = None
 exclude: str = None

filter_unique doesn't feel like it produce any usable output. For starter, you are comparing apples and oranges when output is 'text' because you store the .text attribute of your objects in unique but you are only checking if the object itself is in unique, not its text. Second, you return unique which may contain tweets, statuses or text; but your usage in the Tweet class suggest that you want to return tweets every time.

To improve things, we could:

check unicity using a set whose contain check is done in \$\mathcal{O}(1)\$ instead of \$\mathcal{O}(n)\$ for lists;
extract the desired attribute out of the object to test for unicity but return the actual tweet;
extract attributes step by step instead of having 4 explicit cases;
turn the function into a generator, since it will be fed into the list constructor anyway.

Proposed improvements:

from contextlib import suppress
def filter_unique(tweet_stats_list, output='status'):
 uniques = set()
 for tweet in tweet_stats_list:
 tweet_attr = tweet
 with suppress(AttributeError):
 tweet_attr = tweet_attr.retweeted_status
 if output == 'text':
 tweet_attr = tweet_attr.text
 if tweet_attr not in uniques:
 uniques.add(tweet_attr)
 yield tweet

The Tweet class feels somewhat fine but I don't understand how it improves value over a simple list based on the provided model (I didn't read the other models on your repository, so it may be more obvious there). Other than that the docstring doesn't add any value: this is the class Tweet in the module models, sure this is some kind of """Tweet model"""...

I also don't understand why you allow for variable number of arguments (using *args) but only ever use args[0]: you’re not even guaranteed that there is at least 1 element in args. Better use an explicit argument here.

Lastly, you should use super() instead of explicitly calling the parent class. It doesn't matter much in such case but it's a good habit to get into if you ever use multiple inheritance (or just in case you decide to add a layer of abstraction between Tweet and list).

Proposed improvements:

import operator
class Tweets(list):
 def __init__(self, tweets, filter_by_unique=False, **kwargs):
 if filter_by_unique:
 tweets = filter_unique(tweets)
 super().__init__(tweets, **kwargs)
 @property
 def sorted_by_time(self):
 return sorted(self, key=operator.attrgetter('created_at'))
 @property
 def oldest(self):
 return min(self, key=operator.attrgetter('created_at'))
 @property
 def newest(self):
 return max(self, key=operator.attrgetter('created_at'))

The Author class feels very messy. You extract a handful of information out of your tweets stored as attributes, but half of them end up unused.

Since you're only interested in counting some stats about your author, you should store only those. And to help you along the way (especially counting the number of tweets of a particular author), you can:

Sort the array of tweets by author names before processing it;
Group the array by author names to gather together the tweets of a single author.

Note that I don't know tweepy enough to know if there is a more "unique" identification method than the name of the author. (I do imagine that several author can share a single name, so there must be some.)

Proposed improvements:

import operator
class Authors(object):
 def __init__(self, tweets):
 unicity_key = operator.attrgetter('author.name')
 tweets = sorted(tweets, key=unicity_key)
 self.followers_count = {}
 self.following_count = {}
 self.total_tweets = {}
 self.tweets_count = {}
 for _, author_tweets in itertools.groupby(tweets, key=unicity_key):
 author_tweets = list(author_tweets)
 author = author_tweets[0].author
 self.followers_count[author.name] = author.followers_count
 self.following_count[author.name] = author.friends_count
 self.total_tweets[author.name] = author.statuses_count
 self.tweets_count[author.name] = len(author_tweets)
 def hbar_plot(self, ax, measurement='Followers', color=(0,0,1,1), incolor_measurement=None, height=1, textsize=7, **kwargs):
 measurements = {
 'Followers': self.followers_count,
 'Following': self.following_count,
 'Total Tweets': self.total_tweets,
 'Sample Tweets': self.tweets_count,
 }
 author_measurement = measurements[measurement]
 sorted_authors = sorted(author_measurement, key=author_measurement.__getitem__)
 if isinstance(color, str):
 color = mplcolors.hex2color(mplcolors.cnames[color])
 if incolor_measurement is not None:
 color_measurement = measurements[incolor_measurement]
 minor_max = max(color_measurement.values())
 colors = [(*color[:3], color_measurement[author] / minor_max) for author in sorted_authors]
 measurement = '{} (color: {})'.format(measurement, incolor_measurement)
 else:
 colors = [color] * len(author_measurement)
 ticks, values = zip(*((i + height, author_measurement[author]) for i, author in enumerate(sorted_authors)))
 ax.barh([i - height / 2 for i in ticks], values, height=height, color=colors, **kwargs)
 ax.set_yticks(ticks)
 ax.set_yticklabels(sorted_authors, rotation='horizontal', size=textsize)
 ax.set_xlabel(measurement)
 plt.tight_layout()

Question 3

Thanks for the review. The Adjustment class may be updated with more inputs for more possible adjustments.

Question 4

@Arief Possibly. But you don't extract anything else from kwargs and you don't store it either; so there is no need in having it around.

Question 5

Whyfilter_unique is not useful..? the argument output=... is for preference whether the output is in form of Status with full features, or only the text. Each Status object has unique ID, two with same ID could be mixed in without notice.. the filtering is on ID by default.

Question 6

@Arief In your original version, if output is text, you check for status but store text in unique; thus no filtering taking place. Now I don't know the inner details of your objects but if statuses have a unique ID, you should store and check for that ID instead.

Question 7

@Arief Because 1) we reuse the expression twice, so better give it a name; 2) namming a lambda defeat the purpose and is better done by defining a function; 3) instead of defining a function here and reinventing the wheel, it's better to use what's already in the language; 4) lambdas are slow anyway and the operator module is implemented in C.

score 2 · Accepted Answer · 2018-07-18 00:40:10Z

You’ve presented 5 units (classes, functions) in 3 different files. I understand that you have a few more in your repository but I think this is over-splitting. You should at least merge adjustment.py into models.py and radial_bar.py into functional.py (and, btw, this kind of file is usually named utils.py).

Out of the 5 units presented here, 2 are unused: split_texts and Adjustment. So a quick word about them:

You can drop the adjustment parameter of split_texts as it is unused;
You can use itertools.chain.from_iterable to flatten nested iterables;
You can simplify optional arguments by dropping **kwargs in favor of default values.

All in all, these two can be condensed to:

import itertools
class Adjustment:
 def __init__(self, title="An adjustment.", freq_lim=None, char_lim=None, exclude=None):
 self.title = title
 self.freq_lim = freq_lim
 self.char_lim = char_lim
 self.exclude = exclude
def split_texts(texts):
 return list(itertools.chain.from_iterable(map(str.split, texts)))

But the Adjustment class would be even more succint and easily extensible if it were one of the new dataclasses:

from dataclasses import dataclass
@dataclass(frozen=True)
class Adjustment:
 title: str = "An adjustment."
 freq_lim: int = None
 char_lim: int = None
 exclude: str = None

filter_unique doesn't feel like it produce any usable output. For starter, you are comparing apples and oranges when output is 'text' because you store the .text attribute of your objects in unique but you are only checking if the object itself is in unique, not its text. Second, you return unique which may contain tweets, statuses or text; but your usage in the Tweet class suggest that you want to return tweets every time.

To improve things, we could:

check unicity using a set whose contain check is done in \$\mathcal{O}(1)\$ instead of \$\mathcal{O}(n)\$ for lists;
extract the desired attribute out of the object to test for unicity but return the actual tweet;
extract attributes step by step instead of having 4 explicit cases;
turn the function into a generator, since it will be fed into the list constructor anyway.

Proposed improvements:

from contextlib import suppress
def filter_unique(tweet_stats_list, output='status'):
 uniques = set()
 for tweet in tweet_stats_list:
 tweet_attr = tweet
 with suppress(AttributeError):
 tweet_attr = tweet_attr.retweeted_status
 if output == 'text':
 tweet_attr = tweet_attr.text
 if tweet_attr not in uniques:
 uniques.add(tweet_attr)
 yield tweet

The Tweet class feels somewhat fine but I don't understand how it improves value over a simple list based on the provided model (I didn't read the other models on your repository, so it may be more obvious there). Other than that the docstring doesn't add any value: this is the class Tweet in the module models, sure this is some kind of """Tweet model"""...

I also don't understand why you allow for variable number of arguments (using *args) but only ever use args[0]: you’re not even guaranteed that there is at least 1 element in args. Better use an explicit argument here.

Lastly, you should use super() instead of explicitly calling the parent class. It doesn't matter much in such case but it's a good habit to get into if you ever use multiple inheritance (or just in case you decide to add a layer of abstraction between Tweet and list).

Proposed improvements:

import operator
class Tweets(list):
 def __init__(self, tweets, filter_by_unique=False, **kwargs):
 if filter_by_unique:
 tweets = filter_unique(tweets)
 super().__init__(tweets, **kwargs)
 @property
 def sorted_by_time(self):
 return sorted(self, key=operator.attrgetter('created_at'))
 @property
 def oldest(self):
 return min(self, key=operator.attrgetter('created_at'))
 @property
 def newest(self):
 return max(self, key=operator.attrgetter('created_at'))

The Author class feels very messy. You extract a handful of information out of your tweets stored as attributes, but half of them end up unused.

Since you're only interested in counting some stats about your author, you should store only those. And to help you along the way (especially counting the number of tweets of a particular author), you can:

Sort the array of tweets by author names before processing it;
Group the array by author names to gather together the tweets of a single author.

Note that I don't know tweepy enough to know if there is a more "unique" identification method than the name of the author. (I do imagine that several author can share a single name, so there must be some.)

Proposed improvements:

import operator
class Authors(object):
 def __init__(self, tweets):
 unicity_key = operator.attrgetter('author.name')
 tweets = sorted(tweets, key=unicity_key)
 self.followers_count = {}
 self.following_count = {}
 self.total_tweets = {}
 self.tweets_count = {}
 for _, author_tweets in itertools.groupby(tweets, key=unicity_key):
 author_tweets = list(author_tweets)
 author = author_tweets[0].author
 self.followers_count[author.name] = author.followers_count
 self.following_count[author.name] = author.friends_count
 self.total_tweets[author.name] = author.statuses_count
 self.tweets_count[author.name] = len(author_tweets)
 def hbar_plot(self, ax, measurement='Followers', color=(0,0,1,1), incolor_measurement=None, height=1, textsize=7, **kwargs):
 measurements = {
 'Followers': self.followers_count,
 'Following': self.following_count,
 'Total Tweets': self.total_tweets,
 'Sample Tweets': self.tweets_count,
 }
 author_measurement = measurements[measurement]
 sorted_authors = sorted(author_measurement, key=author_measurement.__getitem__)
 if isinstance(color, str):
 color = mplcolors.hex2color(mplcolors.cnames[color])
 if incolor_measurement is not None:
 color_measurement = measurements[incolor_measurement]
 minor_max = max(color_measurement.values())
 colors = [(*color[:3], color_measurement[author] / minor_max) for author in sorted_authors]
 measurement = '{} (color: {})'.format(measurement, incolor_measurement)
 else:
 colors = [color] * len(author_measurement)
 ticks, values = zip(*((i + height, author_measurement[author]) for i, author in enumerate(sorted_authors)))
 ax.barh([i - height / 2 for i in ticks], values, height=height, color=colors, **kwargs)
 ax.set_yticks(ticks)
 ax.set_yticklabels(sorted_authors, rotation='horizontal', size=textsize)
 ax.set_xlabel(measurement)
 plt.tight_layout()

Thanks for the review. The Adjustment class may be updated with more inputs for more possible adjustments.
@Arief Possibly. But you don't extract anything else from kwargs and you don't store it either; so there is no need in having it around.
Whyfilter_unique is not useful..? the argument output=... is for preference whether the output is in form of Status with full features, or only the text. Each Status object has unique ID, two with same ID could be mixed in without notice.. the filtering is on ID by default.
@Arief In your original version, if output is text, you check for status but store text in unique; thus no filtering taking place. Now I don't know the inner details of your objects but if statuses have a unique ID, you should store and check for that ID instead.
@Arief Because 1) we reuse the expression twice, so better give it a name; 2) namming a lambda defeat the purpose and is better done by defining a function; 3) instead of defining a function here and reinventing the wheel, it's better to use what's already in the language; 4) lambdas are slow anyway and the operator module is implemented in C.

Stack Exchange Network

Models for analysing Twitter data (*which based on tweepy package)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Models for analysing Twitter data (*which based on tweepy package)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions