I'm looking for a way to optimize this piece of code in Python.
all_users
holds all the Twitter users and friends_ids
is an array of IDs.
g = Graph(directed=True)
all_users = users.find({}, { 'id_str' : 1, 'friends_ids' : 1}) # This is from MongoDb
names = []
edges = []
print 'Processing first 1000 batch'
for idx, user in enumerate(all_users):
if idx % 1000 == 0 and idx != 0:
print 'Processing ' + str(idx) + ' batch'
names.append(str(user['id_str']))
for friend_id in user['friends_ids']:
edges.append((user['id_str'], str(friend_id)))
if not str(friend_id) in names:
names.append(str(friend_id))
print len(names)
print len(edges)
all_users
has 4k records but each of the user has at least 10k of friends_ids
. In order to create a graph I need a list of nodes which I have in names
and list of edges edges
.
The format of edges
would be [(node1, node2), (node2, node3), ...]
, which means that node1
connects to node2
and node2
connect to node3
. I'm not sure why but this script takes almost full 10 min to run. So, I'm looking for a way to optimize this.
2 Answers 2
Try this minor variation over your script, that will probably have a dramatic effect on performance:
names = set()
edges = []
for idx, user in enumerate(all_users):
if idx and not idx % 1000:
print('Processing {} batch',format(idx))
user_id = str(user['id_str'])
names.add(user_id)
for friend_id in user['friends_ids']:
friend_id = str(friend_id)
edges.append((user_id, friend_id))
names.add(friend_id)
The main modification is making names
a set, not a list, Your script is checking membership of every friend_id
in names
, an operation that takes time linear in the size of names
. In contrast, membership in a set is done in constant time.
If you need names
to be a list, simply do:
names = list(names)
after the processing is completed.
Here's your problem:
if not str(friend_id) in names:
names.append(str(friend_id))
names
is a list
. Lookup in a list is linear time, and str
comparison is more expensive than int comparison to boot. Instead, use a different structure:
names = set()
...
names.add(friend_id)
set
lookup is constant time, and sticking with integers means that we can save time on the hashing too. With a few million lookups, that's going to make a big impact.