I currently have a dictionary (Duplicate_combos
) that has a unique identifying number for the key value and the value is a list with two elements, a company code and then either a yes or no (both of these values are currently stored as strings). I am essentially just trying to see where the company code is equal and the second term is no for both.
So if this was my dictionary:
{1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No']}
I would only want to return 1234 and 1235. The code below is what I currently have and I really need to optimize it because while it does work when I tested it on a small data set, I will need to use it on a much larger one (43,000 lines) and in early testing, it is taking 45+ minutes with seemingly no sign of ending soon.
def open_file():
in_file = open("./Data.csv","r")
blank = in_file.readline()
titles = in_file.readline()
titles = titles.strip()
titles = titles.split(',')
cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11
total_lines = {}
for line in in_file:
line = line.strip()
line = line.split(',')
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])
# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]
return(cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center)
def find_duplicates(Duplicate_combos):
Real_duplicates = []
archive_duplicates = []
# loop through the dictionary of duplicate combos by the keys
for key in Duplicate_combos:
code = Duplicate_combos[key][0]
for key2 in Duplicate_combos:
# if the two keys are equal to each other, it means you are comparing the key to itself, which we don't want to do so we continue
if key == key2:
continue
# if the company codes are the same and they are BOTH NOT going to be consolidated, we have found a real duplicate
elif Duplicate_combos[key2][0] == code and Duplicate_combos[key2][1] == 'No' and Duplicate_combos[key][1] == 'No':
# make sure that we haven't already dealt with this key before
if key not in archive_duplicates:
Real_duplicates.append(key)
archive_duplicates.append(key)
if key2 not in archive_duplicates:
Real_duplicates.append(key2)
archive_duplicates.append(key2)
continue
return(Real_duplicates)
3 Answers 3
It's easier to read code that tuple unpacks the values in the
for
fromdict.items()
.for key1, (code1, option1) in Duplicate_combos.items():
archive_duplicates
is a duplicate ofReal_duplicates
. There's no need for it.It doesn't seem like the output needs to be ordered, and so you can just make
Real_duplicates
a set. This means it won't have duplicates, and you don't have to loop through it twice each time you want to add a value.This alone speeds up your program from \$O(n^3)\$ to \$O(n^2)\$.
Your variable names are quite poor, and don't adhere to PEP8. I have changed them to somewhat generic names, but it'd be better if you replace, say,
items
with what it actually is.
def find_duplicates(items):
duplicates = set()
for key1, (code1, option1) in items.items():
for key2, (code2, option2) in items.items():
if key1 == key2:
continue
elif code1 == code2 and option1 == option2 == 'No':
duplicates.add(key1)
duplicates.add(key2)
return list(duplicates)
You don't need to loop over
Duplicate_combos
twice.To do this you need to make a new dictionary grouping by the code. And only adding to it if the option is
'No'
.After building the new dictionary you can iterate over it's values and return ones where the length of values is greater or equal to two.
def find_duplicates(items):
by_code = {}
for key, (code, option) in items.items():
if option == 'No':
by_code.setdefault(code, []).append(key)
return [
key
for keys in by_code.values()
if len(keys) >= 2
for key in keys
]
This now runs in \$O(n)\$ time rather than \$O(n^3)\$ time.
>>> find_duplicates({
101: ['1', 'No'], 102: ['1', 'No'],
103: ['1','Yes'], 104: ['1', 'No'],
201: ['2', 'No'], 202: ['2', 'No'],
301: ['3', 'No'], 401: ['4', 'No'],
})
[101, 102, 104, 201, 202]
-
\$\begingroup\$ so this would output all of the keys that have the duplicates not just one? I was iterating twice in order to compare each element to all the others so I would get all of the keys that share the duplicate values \$\endgroup\$Ben Naylor– Ben Naylor2019年06月03日 20:34:27 +00:00Commented Jun 3, 2019 at 20:34
-
\$\begingroup\$ @BenNaylor Yes this would do that. Please see the update with the example showing this. \$\endgroup\$2019年06月03日 20:38:26 +00:00Commented Jun 3, 2019 at 20:38
-
\$\begingroup\$ Thank you so much, this really really helps! \$\endgroup\$Ben Naylor– Ben Naylor2019年06月04日 12:20:16 +00:00Commented Jun 4, 2019 at 12:20
When reading your data, you open
a file but never .close()
it. You should take the habit to use the with
keyword to avoid this issue.
You should also benefit from the csv
module to read this file as it will remove boilerplate and handle special cases for you:
def open_file(filename='./Data.csv'):
cost_center = [] # 0
cost_center_name = []# 1
management_site = [] # 15
sub_function = [] #19
LER = [] #41
Company_name = [] #3
Business_group = [] #7
Value_center = [] #9
Performance_center = [] #10
Profit_center = [] #11
total_lines = {}
with open(filename) as in_file:
next(in_file) # skip blank line
reader = csv.reader(in_file, delimiter=',')
for line in reader:
cost_center.append(line[0])
cost_center_name.append(line[1])
management_site.append(line[15])
sub_function.append(line[19])
LER.append(line[41])
Company_name.append(line[3])
Business_group.append(line[7])
Value_center.append(line[9])
Performance_center.append(line[10])
Profit_center.append(line[11])
# create a dictionary of all the lines with the key being the unique cost center number (cost_center list)
total_lines[line[0]] = line[1:]
return cost_center, cost_center_name, management_site, sub_function, LER, Company_name, Business_group, total_lines, titles, Value_center, Performance_center, Profit_center
-
\$\begingroup\$ I'd personally use something like
columns = zip(*reader)
and then define each value once.cost_center = columns[0]
. This would maketotal_lines
a bit more finicky tho. \$\endgroup\$2019年06月04日 10:39:31 +00:00Commented Jun 4, 2019 at 10:39 -
\$\begingroup\$ @Peilonrayz When I read
LER.append(line[41])
and there is only 10 columns of interest, I’m not sure this is really worth it. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2019年06月04日 12:46:48 +00:00Commented Jun 4, 2019 at 12:46
Doing
def get_dupes(df):
if sum(df.loc[1]=='No')<2:
return None
else:
return list(df.loc[:,df.loc[1]=='No'].columns)
df.groupby(axis=1,by=df.loc[0]).apply(get_dupes)
Got me
0
124 None
123 [1234, 1235]
dtype: object
Your question wasn't quite clear on what you want the output to be if there are multiple company values with duplicate values (e.g. if the input is {1234: ['123' , 'No'] , 1235:['123', 'No'], 1236: ['123','Yes'], 1237: [124,'No'],1238: [124,'No']}
do you want [1234, 1235, 1237, 1238]
or [[1234, 1235], [1237, 1238]]
), so you can modify this code accordingly.
-
1\$\begingroup\$ You could just take a look at how the current code behaves to understand what output is expected... \$\endgroup\$Vogel612– Vogel6122019年06月04日 10:05:55 +00:00Commented Jun 4, 2019 at 10:05
-
2\$\begingroup\$ You have presented an alternative solution, but haven't reviewed the code. Please edit to show what aspects of the question code prompted you to write this version, and in what ways it's an improvement over the original. It may be worth (re-)reading How to Answer. \$\endgroup\$Toby Speight– Toby Speight2019年06月04日 10:07:22 +00:00Commented Jun 4, 2019 at 10:07
Explore related questions
See similar questions with these tags.
Duplicate_combos
come from? The right performance fix would likely involve putting that data into a more appropriate data structure for this task. \$\endgroup\$