5
\$\begingroup\$

I'm looking for a way to optimize this piece of code in Python.

all_users holds all the Twitter users and friends_ids is an array of IDs.

g = Graph(directed=True)
all_users = users.find({}, { 'id_str' : 1, 'friends_ids' : 1}) # This is from MongoDb
names = []
edges = []
print 'Processing first 1000 batch'
for idx, user in enumerate(all_users):
 if idx % 1000 == 0 and idx != 0:
 print 'Processing ' + str(idx) + ' batch'
 names.append(str(user['id_str']))
 for friend_id in user['friends_ids']:
 edges.append((user['id_str'], str(friend_id)))
 if not str(friend_id) in names:
 names.append(str(friend_id))
print len(names)
print len(edges)

all_users has 4k records but each of the user has at least 10k of friends_ids. In order to create a graph I need a list of nodes which I have in names and list of edges edges.

The format of edges would be [(node1, node2), (node2, node3), ...], which means that node1 connects to node2 and node2 connect to node3. I'm not sure why but this script takes almost full 10 min to run. So, I'm looking for a way to optimize this.

200_success
145k22 gold badges190 silver badges478 bronze badges
asked Sep 9, 2015 at 23:39
\$\endgroup\$

2 Answers 2

5
\$\begingroup\$

Try this minor variation over your script, that will probably have a dramatic effect on performance:

names = set()
edges = []
for idx, user in enumerate(all_users):
 if idx and not idx % 1000:
 print('Processing {} batch',format(idx))
 user_id = str(user['id_str'])
 names.add(user_id)
 for friend_id in user['friends_ids']:
 friend_id = str(friend_id)
 edges.append((user_id, friend_id))
 names.add(friend_id)

The main modification is making names a set, not a list, Your script is checking membership of every friend_id in names, an operation that takes time linear in the size of names. In contrast, membership in a set is done in constant time.

If you need names to be a list, simply do:

names = list(names)

after the processing is completed.

Barry
18.5k1 gold badge40 silver badges92 bronze badges
answered Sep 10, 2015 at 1:17
\$\endgroup\$
4
\$\begingroup\$

Here's your problem:

if not str(friend_id) in names:
 names.append(str(friend_id))

names is a list. Lookup in a list is linear time, and str comparison is more expensive than int comparison to boot. Instead, use a different structure:

names = set()
...
names.add(friend_id)

set lookup is constant time, and sticking with integers means that we can save time on the hashing too. With a few million lookups, that's going to make a big impact.

answered Sep 10, 2015 at 1:18
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.