Finding dictionary keys whose values are duplicates

Question 1

I currently have a dictionary (Duplicate_combos) that has a unique identifying number for the key value and the value is a list with two elements, a company code and then either a yes or no (both of these values are currently stored as strings). I am essentially just trying to see where the company code is equal and the second term is no for both. So if this was my dictionary:

{1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']}

I would only want to return 1234 and 1235. The code below is what I currently have and I really need to optimize it because while it does work when I tested it on a small data set, I will need to use it on a much larger one (43,000 lines) and in early testing, it is taking 45+ minutes with seemingly no sign of ending soon.

def open_file():
 in_file = open("./Data.csv","r")
 blank = in_file.readline()
 titles = in_file.readline()
 titles = titles.strip()
 titles = titles.split(',')
 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9 
 Performance_center = [] #10
 Profit_center = [] #11
 total_lines = {}
 for line in in_file:
 line = line.strip()
 line = line.split(',')
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])
 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]
 return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)
def find_duplicates(Duplicate_combos):
 Real_duplicates = []
 archive_duplicates = []
 # loop through the dictionary of duplicate combos by the keys 
 for key in Duplicate_combos:
 code = Duplicate_combos[key][0]
 for key2 in Duplicate_combos:
 # if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
 if key == key2:
 continue
 # if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate 
 elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
 # make sure that we haven't already dealt with this key before
 if key not in archive_duplicates:
 Real_duplicates.append(key)
 archive_duplicates.append(key)
 if key2 not in archive_duplicates:
 Real_duplicates.append(key2)
 archive_duplicates.append(key2)
 continue 
 return(Real_duplicates)

Question 2

Where does the data for Duplicate_combos come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task.

Question 3

The data comes from a csv file that I read in as part of earlier functions. Based on when I have been running it, this function seems to be the one that is taking significantly longer to run

Question 4

In that case, I recommend including the CSV-reading code, as well as an excerpt from the CSV file, so that we can give you the proper advice. Also, please fix your indentation. One easy way to post code is to paste it into the question editor, highlight it, and press Ctrl-K to mark it as a code block.

Question 5

I added the open file function, a lot of the stuff that is returned is used elsewhere so idk if it helps at all. As for the data, I can't share that but from the testing that I did, I know that everything was being read in correctly and all that. At this point, the code that I have works, just REALLY NOT optimally so that's the main thing that I was looking for. I haven't had too much experience with optimization so I was hoping to get some ideas on how exactly to do that

Question 6

Interesting! That is a very unconventional way to read a CSV, and now I'm intrigued as to how you make use of those weird lists. You could probably benefit a lot from putting your entire program up for review.

Question 7

It's easier to read code that tuple unpacks the values in the for from dict.items().
```
for key1, (code1, option1) in Duplicate_combos.items():
```
archive_duplicates is a duplicate of Real_duplicates. There's no need for it.
It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.

This alone speeds up your program from \$O(n^3)\$ to \$O(n^2)\$.
Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.

def find_duplicates(items):
 duplicates = set()
 for key1, (code1, option1) in items.items():
 for key2, (code2, option2) in items.items():
 if key1 == key2:
 continue
 elif code1 == code2 and option1 == option2 == 'No':
 duplicates.add(key1)
 duplicates.add(key2)
 return list(duplicates)

You don't need to loop over Duplicate_combos twice.

To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.

After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.

def find_duplicates(items):
 by_code = {}
 for key, (code, option) in items.items():
 if option == 'No':
 by_code.setdefault(code, []).append(key)
 return [
 key
 for keys in by_code.values()
 if len(keys) >= 2
 for key in keys
 ]

This now runs in \$O(n)\$ time rather than \$O(n^3)\$ time.

>>> find_duplicates({
 101: ['1', 'No'], 102: ['1', 'No'],
 103: ['1','Yes'], 104: ['1', 'No'],
 201: ['2', 'No'], 202: ['2', 'No'],
 301: ['3', 'No'], 401: ['4', 'No'],
})
[101, 102, 104, 201, 202]

Question 8

so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values

Question 9

@BenNaylor Yes this would do that. Please see the update with the example showing this.

Question 10

Thank you so much, this really really helps!

Question 11

When reading your data, you open a file but never .close() it. You should take the habit to use the with keyword to avoid this issue.

You should also benefit from the csv module to read this file as it will remove boilerplate and handle special cases for you:

def open_file(filename='./Data.csv'):
 cost_center = [] # 0
 cost_center_name = []# 1
 management_site = [] # 15
 sub_function = [] #19
 LER = [] #41
 Company_name = [] #3
 Business_group = [] #7
 Value_center = [] #9
 Performance_center = [] #10
 Profit_center = [] #11
 total_lines = {}
 with open(filename) as in_file:
 next(in_file) # skip blank line
 reader = csv.reader(in_file, delimiter=',')
 for line in reader:
 cost_center.append(line[0])
 cost_center_name.append(line[1])
 management_site.append(line[15])
 sub_function.append(line[19])
 LER.append(line[41])
 Company_name.append(line[3])
 Business_group.append(line[7])
 Value_center.append(line[9])
 Performance_center.append(line[10])
 Profit_center.append(line[11])
 # create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
 total_lines[line[0]] = line[1:]
 return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center

Question 12

I'd personally use something like columns = zip(*reader) and then define each value once. cost_center = columns[0]. This would make total_lines a bit more finicky tho.

Question 13

@Peilonrayz When I read LER.append(line[41]) and there is only 10 columns of interest, I’m not sure this is really worth it.

Question 14

Doing

def get_dupes(df):
 if sum(df.loc[1]=='No')<2:
 return None
 else:
 return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)

Got me

 0
 124 None
 123 [1234, 1235]
 dtype: object

Your question wasn't quite clear on what you want the output to be if there are multiple company values with duplicate values (e.g. if the input is {1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No'],1238: [124,'No']} do you want [1234, 1235, 1237, 1238] or [[1234, 1235], [1237, 1238]]), so you can modify this code accordingly.

Question 15

You could just take a look at how the current code behaves to understand what output is expected...

Question 16

You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer.

score 7 · Accepted Answer · 2019-06-03 20:24:34Z

It's easier to read code that tuple unpacks the values in the for from dict.items().
```
for key1, (code1, option1) in Duplicate_combos.items():
```
archive_duplicates is a duplicate of Real_duplicates. There's no need for it.
It doesn't seem like the output needs to be ordered, and so you can just make Real_duplicates a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.

This alone speeds up your program from \$O(n^3)\$ to \$O(n^2)\$.
Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say, items with what it actually is.

def find_duplicates(items):
 duplicates = set()
 for key1, (code1, option1) in items.items():
 for key2, (code2, option2) in items.items():
 if key1 == key2:
 continue
 elif code1 == code2 and option1 == option2 == 'No':
 duplicates.add(key1)
 duplicates.add(key2)
 return list(duplicates)

You don't need to loop over Duplicate_combos twice.

To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is 'No'.

After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.

def find_duplicates(items):
 by_code = {}
 for key, (code, option) in items.items():
 if option == 'No':
 by_code.setdefault(code, []).append(key)
 return [
 key
 for keys in by_code.values()
 if len(keys) >= 2
 for key in keys
 ]

This now runs in \$O(n)\$ time rather than \$O(n^3)\$ time.

>>> find_duplicates({
 101: ['1', 'No'], 102: ['1', 'No'],
 103: ['1','Yes'], 104: ['1', 'No'],
 201: ['2', 'No'], 202: ['2', 'No'],
 301: ['3', 'No'], 401: ['4', 'No'],
})
[101, 102, 104, 201, 202]

so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values
@BenNaylor Yes this would do that. Please see the update with the example showing this.

Stack Exchange Network

Finding dictionary keys whose values are duplicates

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding dictionary keys whose values are duplicates

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions