Parsing files with county codes

Question 1

I am running through a file and inserting elements one by one. The counties all contain specific county codes which are duplicated many times throughout the file. I am looking for a way to assign these codes to a specific county while ignoring duplicate county codes.

I have two versions I wrote below with runtimes:

getCodes()

def get_codes(data):
 code_info = {}
 for row in data:
 county = row["county"]
 code = int(float(row["code"]))
 if code > 100000:
 code = code/100
 if county not in code_info:
 code_info[county] = []
 code_info[county].append(code)
 for county in code_info:
 code_info[county] = list(set(code_info[county]))
 return code_info

get_codes2()

def get_codes2(data):
 code_info = {}
 for row in data:
 county = row["county"]
 code = int(float(row["code"]))
 if code > 100000:
 code = code/100
 if county in code_info:
 if not code in code_info[county]:
 code_info[county].append(code)
 else:
 code_info[county] = []
 code_info[county].append(code)
 return code_info

county_data = csv.DictReader(open("county_file.txt"))
start = time.time()
county_codes = get_codes(county_data)
end = time.time()
print "run time: " + str(end-start)
county_data = csv.DictReader(open("county_file.txt"))
start = time.time()
county_codes = get_codes2(county_data)
end = time.time()
print "run time: " + str(end-start)

Also, it's probably obvious from this, but county codes that are greater than 100000 can have trailing zeroes accidentally added, so I'm removing them by dividing by 100. As another note, the int(float()) conversion is intentional. Sometimes the county codes are values such as "27.7" and need to be converted to "27", other times they are just basic ints.

The runtimes on my system:

get_codes: 9 seconds
get_codes2: 14 seconds

How can I improve this further for better performance?

Question 2

Python 2.5 supports a defaultdict, which might save some time with the "if county in code_info" checks.

Question 3

I think you're pretty close to optimal here. I made a few tweaks, to avoid some conditionals by using sets instead. I don't actually know if it'll run faster; you'll need to benchmark it, and it likely depends on the breakdown of how many dupes per county there are.

def get_codes3(data):
 from collections import defaultdict
 codeinfo = defaultdict(set)
 for row in data:
 county = row["county"]
 # remove float() if you can get away with it - are there
 # actually values like '1.5' in the data?
 code = int(float(row["code"]))
 if code > 100000:
 code = code/100
 codeinfo[county].add(code)
 # if you can get away with sets instead of lists in your return
 # value, you're good
 return code_info
 # otherwise:
 # return dict([(county, list(info)) for county, info in code_info.iteritems()])

You don't need float() unless the data is actually like "123.45" - it's not clear if that's the case
sets can work in most places that lists work, so you might not need to convert to a list
It might be worth it to write a script that does just the x = x/100 if x > 100000 part and writes that out to a new file

Question 4

I ran a few bench marks, the time difference was negligible (usually within a few hundredth of a second). To respond to a few of your questions: 1. Yes, I need int(float()) there. There are some values like "123.45" that need to be converted to "123", but most are already integers. You are definitely right though, I should have mentioned this in my original post 2. Yeah, I

Question 5

2. You may be correct on the set issue, but I need to access the data structure by index later on, so I'd prefer to convert it to a list. 3. That would be ideal solution, but the data in the file may change but the format issue won't. You solution looks a lot more elegant than mine though!

Question 6

@MikeJ, if you ran AdamKG's exact code for your benchmark, you should try it again with the import moved outside of the function. As I understand imports are actually pretty expensive. Also, try dropping the square brackets for the alternate return. There's not much reason to generate a list first.

Question 7

Additionally, you could write code = code / 100 as code /= 100 if you're going for brevity. It doesn't do much for performance, but it's quicker to read.

AdamKG AdamKGAdamKG 1112 bronze badges · Answer 1 · 2012-01-25 18:49:47Z

I think you're pretty close to optimal here. I made a few tweaks, to avoid some conditionals by using sets instead. I don't actually know if it'll run faster; you'll need to benchmark it, and it likely depends on the breakdown of how many dupes per county there are.

def get_codes3(data):
 from collections import defaultdict
 codeinfo = defaultdict(set)
 for row in data:
 county = row["county"]
 # remove float() if you can get away with it - are there
 # actually values like '1.5' in the data?
 code = int(float(row["code"]))
 if code > 100000:
 code = code/100
 codeinfo[county].add(code)
 # if you can get away with sets instead of lists in your return
 # value, you're good
 return code_info
 # otherwise:
 # return dict([(county, list(info)) for county, info in code_info.iteritems()])

You don't need float() unless the data is actually like "123.45" - it's not clear if that's the case
sets can work in most places that lists work, so you might not need to convert to a list
It might be worth it to write a script that does just the x = x/100 if x > 100000 part and writes that out to a new file

I ran a few bench marks, the time difference was negligible (usually within a few hundredth of a second). To respond to a few of your questions: 1. Yes, I need int(float()) there. There are some values like "123.45" that need to be converted to "123", but most are already integers. You are definitely right though, I should have mentioned this in my original post 2. Yeah, I
2. You may be correct on the set issue, but I need to access the data structure by index later on, so I'd prefer to convert it to a list. 3. That would be ideal solution, but the data in the file may change but the format issue won't. You solution looks a lot more elegant than mine though!
@MikeJ, if you ran AdamKG's exact code for your benchmark, you should try it again with the import moved outside of the function. As I understand imports are actually pretty expensive. Also, try dropping the square brackets for the alternate return. There's not much reason to generate a list first.
Additionally, you could write code = code / 100 as code /= 100 if you're going for brevity. It doesn't do much for performance, but it's quicker to read.

Stack Exchange Network

Parsing files with county codes

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing files with county codes

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions