Creating nodes and edges

Question 1

I'm looking for a way to optimize this piece of code in Python.

all_users holds all the Twitter users and friends_ids is an array of IDs.

g = Graph(directed=True)
all_users = users.find({}, { 'id_str' : 1, 'friends_ids' : 1}) # This is from MongoDb
names = []
edges = []
print 'Processing first 1000 batch'
for idx, user in enumerate(all_users):
 if idx % 1000 == 0 and idx != 0:
 print 'Processing ' + str(idx) + ' batch'
 names.append(str(user['id_str']))
 for friend_id in user['friends_ids']:
 edges.append((user['id_str'], str(friend_id)))
 if not str(friend_id) in names:
 names.append(str(friend_id))
print len(names)
print len(edges)

all_users has 4k records but each of the user has at least 10k of friends_ids. In order to create a graph I need a list of nodes which I have in names and list of edges edges.

The format of edges would be [(node1, node2), (node2, node3), ...], which means that node1 connects to node2 and node2 connect to node3. I'm not sure why but this script takes almost full 10 min to run. So, I'm looking for a way to optimize this.

Question 2

Try this minor variation over your script, that will probably have a dramatic effect on performance:

names = set()
edges = []
for idx, user in enumerate(all_users):
 if idx and not idx % 1000:
 print('Processing {} batch',format(idx))
 user_id = str(user['id_str'])
 names.add(user_id)
 for friend_id in user['friends_ids']:
 friend_id = str(friend_id)
 edges.append((user_id, friend_id))
 names.add(friend_id)

The main modification is making names a set, not a list, Your script is checking membership of every friend_id in names, an operation that takes time linear in the size of names. In contrast, membership in a set is done in constant time.

If you need names to be a list, simply do:

names = list(names)

after the processing is completed.

Question 3

Here's your problem:

if not str(friend_id) in names:
 names.append(str(friend_id))

names is a list. Lookup in a list is linear time, and str comparison is more expensive than int comparison to boot. Instead, use a different structure:

names = set()
...
names.add(friend_id)

set lookup is constant time, and sticking with integers means that we can save time on the hashing too. With a few million lookups, that's going to make a big impact.

Jaime Jaime 6,62721 silver badges30 bronze badges · Accepted Answer · 2015-09-10 01:17:46Z

Try this minor variation over your script, that will probably have a dramatic effect on performance:

names = set()
edges = []
for idx, user in enumerate(all_users):
 if idx and not idx % 1000:
 print('Processing {} batch',format(idx))
 user_id = str(user['id_str'])
 names.add(user_id)
 for friend_id in user['friends_ids']:
 friend_id = str(friend_id)
 edges.append((user_id, friend_id))
 names.add(friend_id)

The main modification is making names a set, not a list, Your script is checking membership of every friend_id in names, an operation that takes time linear in the size of names. In contrast, membership in a set is done in constant time.

If you need names to be a list, simply do:

names = list(names)

after the processing is completed.

Stack Exchange Network

Creating nodes and edges

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Creating nodes and edges

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions