7
\$\begingroup\$

My input file is a CSV of varying size, up to around 10GB. The file has several fields, however I'm only interested in the third column, a date-time field. The date-times are in UTC timezone, and they are not ordered.

Example of values in column:

2017年08月03日T10:22:31.000Z
2017年08月03日T10:22:32.000Z
2017年08月03日T10:22:37.000Z
2017年08月03日T10:22:40.000Z
...

My desired output is a CSV which counts the number of date-times by hour, which have been converted to a specified user timezone.

example output file:

2017年08月01日 05:00:00,230
2017年08月01日 06:00:00,3340
2017年08月01日 07:00:00,4362
2017年08月01日 08:00:00,1747
2017年08月01日 09:00:00,5676
2017年08月01日 10:00:00,6955
...

Below is the working code I have written:

dates = {}
with open(myInputFile) as file: 
 reader = csv.reader(file)
 for row in reader:
 row_date = datetime.datetime.strptime(row[2],"%Y-%m-%dT%H:%M:%S.%fZ").replace(tzinfo=tz.gettz('UTC'))
 row_date = row_date.astimezone(tz.gettz(newTimezone)).replace(tzinfo=None)
 row_date = row_date.strftime("%Y-%m-%d %H:00:00") #Strips minute and below info
 if row_date in dates: 
 dates[row_date] += 1
 else: #If date not in dictionary add entry
 dates[row_date] = 1
rows = zip([k for k in sorted(dates)],[dates[k] for k in sorted(dates)]) #changes dict to date ordered zip
with open('WaypointCount.csv'),'w', newline='') as output: #saves to csv
 wr = csv.writer(output)
 for row in rows:
 wr.writerow(row)

Basically I use the csv module to read each entry line by line.

I convert the string into a datetime, and set it's timezone to UTC.

I then convert the datetime to the new Timezone. The reason I do that at this step rather than later is because some timezones are offset from UTC by 30 minutes, which would mess up my hourly grouping.

I then convert it back into a string, stripping away the minute, second and microsecond information, and add a count to my dictionary.

Once I have looped through every row I convert my dictionary into two lists ordered by the Date key, zip them together, and write to a new csv.

I feel like I've probably converted between too many datatypes. How can I improve this code so it follows best practices, and runs optimally?

301_Moved_Permanently
29.4k3 gold badges49 silver badges98 bronze badges
asked Sep 11, 2017 at 15:01
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

Syntax

Your code does not have valid Python syntax.
The first closing parentheses in this line should not be there:

with open('WaypointCount.csv'),'w', newline='') as output: #saves to csv

Improvements

Use if __name__ == '__main__'

If you run script code, enclose it within an if block checking for __main__:

if __name__ == '__main__':
 <your_code_here>

LBYL vs. default dicts

The check statement

if row_date in dates: 
 dates[row_date] += 1
else: #If date not in dictionary add entry
 dates[row_date] = 1

can be optimzed using a collections.defaultdict:

from collections import defaultdict
dates = defaultdict(int)
dates[row_date] += 1

The defaultdict will automatically initialize a non-existant key's value with 0, by calling the provided factory int without parameters iff the requested key does not exist.

Dict key and value sorting

The line

rows = zip([k for k in sorted(dates)],[dates[k] for k in sorted(dates)])

seems quite cumbersome to me.
You iterate and sort the dict twice and then zip the keys and values.
I suggest changing that to:

with open('WaypointCount.csv','w', newline='') as output: #saves to csv
 wr = csv.writer(output)
 for row in sorted(dates.items()):
 wr.writerow(row)

Use fitting datatypes

In the line

row_date = row_date.strftime("%Y-%m-%d %H:00:00") #Strips minute and below info

You already convert the datetime back to a str although you later will sort by it. While this should have the same beaviour iff the datetime string is in ISO format, you may want to store the actual datetime value instead of the string representation and later convert it into the desired output format.

with open('WaypointCount.csv','w', newline='') as output: #saves to csv
 wr = csv.writer(output)
 for timestamp, occurences in sorted(dict.items()):
 timestamp_str = timestamp.strftime("%Y-%m-%d %H:00:00")
 wr.writerow((timestamp_str, occurences))

PEP8

Last, but not least, consider PEP8.
Espacially your variable naming and line length.

answered Sep 13, 2017 at 16:26
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.