Count occurrences of date-time in large CSV

Question 1

My input file is a CSV of varying size, up to around 10GB. The file has several fields, however I'm only interested in the third column, a date-time field. The date-times are in UTC timezone, and they are not ordered.

Example of values in column:

2017年08月03日T10:22:31.000Z
2017年08月03日T10:22:32.000Z
2017年08月03日T10:22:37.000Z
2017年08月03日T10:22:40.000Z
...

My desired output is a CSV which counts the number of date-times by hour, which have been converted to a specified user timezone.

example output file:

2017年08月01日 05:00:00,230
2017年08月01日 06:00:00,3340
2017年08月01日 07:00:00,4362
2017年08月01日 08:00:00,1747
2017年08月01日 09:00:00,5676
2017年08月01日 10:00:00,6955
...

Below is the working code I have written:

dates = {}
with open(myInputFile) as file: 
 reader = csv.reader(file)
 for row in reader:
 row_date = datetime.datetime.strptime(row[2],"%Y-%m-%dT%H:%M:%S.%fZ").replace(tzinfo=tz.gettz('UTC'))
 row_date = row_date.astimezone(tz.gettz(newTimezone)).replace(tzinfo=None)
 row_date = row_date.strftime("%Y-%m-%d %H:00:00") #Strips minute and below info
 if row_date in dates: 
 dates[row_date] += 1
 else: #If date not in dictionary add entry
 dates[row_date] = 1
rows = zip([k for k in sorted(dates)],[dates[k] for k in sorted(dates)]) #changes dict to date ordered zip
with open('WaypointCount.csv'),'w', newline='') as output: #saves to csv
 wr = csv.writer(output)
 for row in rows:
 wr.writerow(row)

Basically I use the csv module to read each entry line by line.

I convert the string into a datetime, and set it's timezone to UTC.

I then convert the datetime to the new Timezone. The reason I do that at this step rather than later is because some timezones are offset from UTC by 30 minutes, which would mess up my hourly grouping.

I then convert it back into a string, stripping away the minute, second and microsecond information, and add a count to my dictionary.

Once I have looped through every row I convert my dictionary into two lists ordered by the Date key, zip them together, and write to a new csv.

I feel like I've probably converted between too many datatypes. How can I improve this code so it follows best practices, and runs optimally?

Question 2

Syntax

Your code does not have valid Python syntax.
The first closing parentheses in this line should not be there:

with open('WaypointCount.csv'),'w', newline='') as output: #saves to csv

Improvements

Use `if name == 'main'`

If you run script code, enclose it within an if block checking for __main__:

if __name__ == '__main__':
 <your_code_here>

LBYL vs. default dicts

The check statement

if row_date in dates: 
 dates[row_date] += 1
else: #If date not in dictionary add entry
 dates[row_date] = 1

can be optimzed using a collections.defaultdict:

from collections import defaultdict
dates = defaultdict(int)
dates[row_date] += 1

The defaultdict will automatically initialize a non-existant key's value with 0, by calling the provided factory int without parameters iff the requested key does not exist.

Dict key and value sorting

The line

rows = zip([k for k in sorted(dates)],[dates[k] for k in sorted(dates)])

seems quite cumbersome to me.
You iterate and sort the dict twice and then zip the keys and values.
I suggest changing that to:

with open('WaypointCount.csv','w', newline='') as output: #saves to csv
 wr = csv.writer(output)
 for row in sorted(dates.items()):
 wr.writerow(row)

Use fitting datatypes

In the line

row_date = row_date.strftime("%Y-%m-%d %H:00:00") #Strips minute and below info

You already convert the datetime back to a str although you later will sort by it. While this should have the same beaviour iff the datetime string is in ISO format, you may want to store the actual datetime value instead of the string representation and later convert it into the desired output format.

with open('WaypointCount.csv','w', newline='') as output: #saves to csv
 wr = csv.writer(output)
 for timestamp, occurences in sorted(dict.items()):
 timestamp_str = timestamp.strftime("%Y-%m-%d %H:00:00")
 wr.writerow((timestamp_str, occurences))

PEP8

Last, but not least, consider PEP8.
Espacially your variable naming and line length.

Richard Neumann Richard Neumann 6,42122 silver badges41 bronze badges · Accepted Answer · 2017-09-13 16:26:36Z

Syntax

Your code does not have valid Python syntax.
The first closing parentheses in this line should not be there:

with open('WaypointCount.csv'),'w', newline='') as output: #saves to csv

Improvements

Use `if name == 'main'`

If you run script code, enclose it within an if block checking for __main__:

if __name__ == '__main__':
 <your_code_here>

LBYL vs. default dicts

The check statement

if row_date in dates: 
 dates[row_date] += 1
else: #If date not in dictionary add entry
 dates[row_date] = 1

can be optimzed using a collections.defaultdict:

from collections import defaultdict
dates = defaultdict(int)
dates[row_date] += 1

The defaultdict will automatically initialize a non-existant key's value with 0, by calling the provided factory int without parameters iff the requested key does not exist.

Dict key and value sorting

The line

rows = zip([k for k in sorted(dates)],[dates[k] for k in sorted(dates)])

seems quite cumbersome to me.
You iterate and sort the dict twice and then zip the keys and values.
I suggest changing that to:

with open('WaypointCount.csv','w', newline='') as output: #saves to csv
 wr = csv.writer(output)
 for row in sorted(dates.items()):
 wr.writerow(row)

Use fitting datatypes

In the line

row_date = row_date.strftime("%Y-%m-%d %H:00:00") #Strips minute and below info

You already convert the datetime back to a str although you later will sort by it. While this should have the same beaviour iff the datetime string is in ISO format, you may want to store the actual datetime value instead of the string representation and later convert it into the desired output format.

with open('WaypointCount.csv','w', newline='') as output: #saves to csv
 wr = csv.writer(output)
 for timestamp, occurences in sorted(dict.items()):
 timestamp_str = timestamp.strftime("%Y-%m-%d %H:00:00")
 wr.writerow((timestamp_str, occurences))

PEP8

Last, but not least, consider PEP8.
Espacially your variable naming and line length.

Stack Exchange Network

Count occurrences of date-time in large CSV

1 Answer 1

Syntax

Improvements

Use `if name == 'main'`

LBYL vs. default dicts

Dict key and value sorting

Use fitting datatypes

PEP8

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Count occurrences of date-time in large CSV

1 Answer 1

Syntax

Improvements

Use if __name__ == '__main__'

LBYL vs. default dicts

Dict key and value sorting

Use fitting datatypes

PEP8

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Use `if name == 'main'`