Distribution problem optimization

Question 1

The situation is this: I have a bunch of servers which are synced so that they all have the same data. Unfortunately, a disaster happened and the servers are all out of sync. My task is to re-sync all of the servers so that they all have the same sets of data. That is, I need to ensure that each server has a copy of every data set.

The first line of input denotes the number of servers. Following that will be one line of input for each server with a space-separated list of data set ID's. i.e, x y z. Data set id's are positive integers.

I must output a list of instructions to most optimally re-sync the servers in the format of: < data set ID>< FROM>< TO>

#input = {
#1:[1, 3, 4],
#2:[1, 2, 3],
#3:[1, 3],
#4:[1, 4, 2]}
numCenters = int(raw_input('Number of data centers: '))
input = {}
print("Input data set information as: x y z")
# Grab dataset ID information from stdin
for x in range(1, numCenters+1):
 dataSet = raw_input("Data set %s: " % (x))
 input[x] = sorted(map(int, dataSet.split()))
#Map datasets (the numbers / dataset ID) to data centers (the individual lists) that they belong to
#New dictionary for the map
alldatasets = {}
for k,v in input.iteritems():
 for dataset in v: #
 if dataset not in alldatasets:
 alldatasets[dataset] = [] # Make a dictionary with the key as the dataset ID,
 alldatasets[dataset].append(k) # and the value as a list of which datacenters have that value.
allsets = list(alldatasets.keys())
print("One Possible Correct Output:\n")
for Id,datacenter in input.iteritems():
 for sets in allsets: #go through every datacenter, and compare the datasets it has to a list of all datasets.
 if sets not in datacenter:
 print("%s %s %s" % (sets, alldatasets[sets][0], Id))
print("done")

Input:

Output:

One possible correct output:

I am looking to improve the code that I have written. Either optimizing its run time, making it look better, or any other comments are welcome. I am moreso looking for optimization feedback, though.

Question 2

Your code is full of variables named somethingSet — but they aren't actually sets! Why not?

This solution, which takes advantage of Python's built-in set operations, is shorter. Just being able to write have_nots = all_centers - haves is worth it.

from collections import defaultdict
num_centers = int(raw_input('Number of data centers: '))
print("Input data set information as: x y z")
all_centers = set(xrange(1, num_centers + 1))
centers_with_data = defaultdict(set)
# Grab dataset ID information from stdin
for center in range(1, num_centers + 1):
 for data_set in map(int, raw_input("Data set %s: " % (center)).split()):
 centers_with_data[data_set].add(center)
print "One possible solution:\n"
for data_set, haves in centers_with_data.iteritems():
 have_nots = all_centers - haves
 donor = next(iter(haves)) # Pick an arbitrary source
 for acceptor in have_nots:
 print "%d %d %d" % (data_set, donor, acceptor)

Question 3

I noticed that the output for input (1,3,5,4,7),(1,3),(2) for 3 servers is different than the output in my original program. Why? There are the same amount of steps so why is it different?

Question 4

As you say, the solution is not unique. Python doesn't guarantee any order when iterating through hashes and sets.

Question 5

I'm not a Python expert but I have similar issues in other technologies. You can try to improve your iterations with some high order functions like map/filter/reduce etc.

For example, here:

for Id,datacenter in input.iteritems():
 for sets in allsets: #go through every datacenter, and compare the datasets it has to a list of all datasets.
 if sets not in datacenter:
 print("%s %s %s" % (sets, alldatasets[sets][0], Id))

you can replace the for with if with a filter.

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Accepted Answer · 2014-12-12 00:59:58Z

Your code is full of variables named somethingSet — but they aren't actually sets! Why not?

This solution, which takes advantage of Python's built-in set operations, is shorter. Just being able to write have_nots = all_centers - haves is worth it.

from collections import defaultdict
num_centers = int(raw_input('Number of data centers: '))
print("Input data set information as: x y z")
all_centers = set(xrange(1, num_centers + 1))
centers_with_data = defaultdict(set)
# Grab dataset ID information from stdin
for center in range(1, num_centers + 1):
 for data_set in map(int, raw_input("Data set %s: " % (center)).split()):
 centers_with_data[data_set].add(center)
print "One possible solution:\n"
for data_set, haves in centers_with_data.iteritems():
 have_nots = all_centers - haves
 donor = next(iter(haves)) # Pick an arbitrary source
 for acceptor in have_nots:
 print "%d %d %d" % (data_set, donor, acceptor)

I noticed that the output for input (1,3,5,4,7),(1,3),(2) for 3 servers is different than the output in my original program. Why? There are the same amount of steps so why is it different?
As you say, the solution is not unique. Python doesn't guarantee any order when iterating through hashes and sets.

Stack Exchange Network

Distribution problem optimization

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Distribution problem optimization

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions