I am running through a file and inserting elements one by one. The counties all contain specific county codes which are duplicated many times throughout the file. I am looking for a way to assign these codes to a specific county while ignoring duplicate county codes.
I have two versions I wrote below with runtimes:
getCodes()
def get_codes(data):
code_info = {}
for row in data:
county = row["county"]
code = int(float(row["code"]))
if code > 100000:
code = code/100
if county not in code_info:
code_info[county] = []
code_info[county].append(code)
for county in code_info:
code_info[county] = list(set(code_info[county]))
return code_info
get_codes2()
def get_codes2(data):
code_info = {}
for row in data:
county = row["county"]
code = int(float(row["code"]))
if code > 100000:
code = code/100
if county in code_info:
if not code in code_info[county]:
code_info[county].append(code)
else:
code_info[county] = []
code_info[county].append(code)
return code_info
county_data = csv.DictReader(open("county_file.txt"))
start = time.time()
county_codes = get_codes(county_data)
end = time.time()
print "run time: " + str(end-start)
county_data = csv.DictReader(open("county_file.txt"))
start = time.time()
county_codes = get_codes2(county_data)
end = time.time()
print "run time: " + str(end-start)
Also, it's probably obvious from this, but county codes that are greater than 100000 can have trailing zeroes accidentally added, so I'm removing them by dividing by 100. As another note, the int(float())
conversion is intentional. Sometimes the county codes are values such as "27.7" and need to be converted to "27", other times they are just basic int
s.
The runtimes on my system:
get_codes
: 9 secondsget_codes2
: 14 seconds
How can I improve this further for better performance?
1 Answer 1
I think you're pretty close to optimal here. I made a few tweaks, to avoid some conditionals by using sets instead. I don't actually know if it'll run faster; you'll need to benchmark it, and it likely depends on the breakdown of how many dupes per county there are.
def get_codes3(data):
from collections import defaultdict
codeinfo = defaultdict(set)
for row in data:
county = row["county"]
# remove float() if you can get away with it - are there
# actually values like '1.5' in the data?
code = int(float(row["code"]))
if code > 100000:
code = code/100
codeinfo[county].add(code)
# if you can get away with sets instead of lists in your return
# value, you're good
return code_info
# otherwise:
# return dict([(county, list(info)) for county, info in code_info.iteritems()])
- You don't need
float()
unless the data is actually like"123.45"
- it's not clear if that's the case set
s can work in most places that lists work, so you might not need to convert to a list- It might be worth it to write a script that does just the
x = x/100 if x > 100000
part and writes that out to a new file
-
\$\begingroup\$ I ran a few bench marks, the time difference was negligible (usually within a few hundredth of a second). To respond to a few of your questions: 1. Yes, I need int(float()) there. There are some values like "123.45" that need to be converted to "123", but most are already integers. You are definitely right though, I should have mentioned this in my original post 2. Yeah, I \$\endgroup\$Mike J– Mike J2012年01月25日 19:07:33 +00:00Commented Jan 25, 2012 at 19:07
-
\$\begingroup\$ 2. You may be correct on the set issue, but I need to access the data structure by index later on, so I'd prefer to convert it to a list. 3. That would be ideal solution, but the data in the file may change but the format issue won't. You solution looks a lot more elegant than mine though! \$\endgroup\$Mike J– Mike J2012年01月25日 19:14:06 +00:00Commented Jan 25, 2012 at 19:14
-
1\$\begingroup\$ @MikeJ, if you ran AdamKG's exact code for your benchmark, you should try it again with the import moved outside of the function. As I understand imports are actually pretty expensive. Also, try dropping the square brackets for the alternate return. There's not much reason to generate a list first. \$\endgroup\$Winston Ewert– Winston Ewert2012年01月25日 19:31:45 +00:00Commented Jan 25, 2012 at 19:31
-
\$\begingroup\$ Additionally, you could write
code = code / 100
ascode /= 100
if you're going for brevity. It doesn't do much for performance, but it's quicker to read. \$\endgroup\$Elmer– Elmer2012年01月26日 15:02:03 +00:00Commented Jan 26, 2012 at 15:02
defaultdict
, which might save some time with the "if county in code_info" checks. \$\endgroup\$