Parse org-mode clocktable CSVs into time sheets

Question 1

For my student job, I have been logging work times with the org-mode in emacs for quite some time. Now since I can only work from remote, I figured it would be nice to automatically use the entries from the .org files into readily-formatted entries. I am doing this because the job requires me to write an Excel sheet with the hours worked each month in a specific format:

For each day, there is one entry
Each entry features the date, start and ending time, pause in minutes, and hours worked.

With the help of the Emacs package org-clock-csv I was able to generate CSV output containing start and end times including dates. I wrote a Python script to parse these into the desired format, and I feel there is a lot of room for improvement.

The input looks like this (testoutput.csv):

organization,,,2020年04月03日 10:49,2020年04月03日 13:19,,,
some stuff,,,2020年04月03日 10:39,2020年04月03日 10:49,,,
more stuff,,,2020年04月02日 12:25,2020年04月02日 12:25,,,
some stuff,,,2020年04月02日 09:43,2020年04月02日 09:47,,,
other stuff,,,2020年04月02日 09:35,2020年04月02日 09:43,,,
organization,,,2020年03月27日 14:00,2020年03月27日 14:28,,,
Orga,,,2020年03月27日 09:10,2020年03月27日 09:42,,,
Orga,,,2020年03月23日 09:13,2020年03月23日 09:25,,,
Orga,,,2020年03月22日 09:56,2020年03月22日 10:03,,,

There are several things the code needs to do: summarize entries for each day, parse the times and dates, and compute total time worked as well as pause times. Pause times are the result of the difference of (latest end - earliest start) and the actual total time worked.

The result should look like this (testoutput_parsed.csv, actual output of my script):

date,start,stop,pause (minutes),total (hours)
02.04.,09:35,12:25,158,00:12
03.04.,10:39,13:19,0,02:40
22.03.,09:56,10:03,0,00:07
23.03.,09:13,09:25,0,00:12
27.03.,09:10,14:28,258,01:00

As far as I can tell, the output is correct. However, I am looking for comments on code quality in terms of structure, complying with conventions and such.

Here is the actual code:

import datetime
from operator import itemgetter
import csv
def read_timestamps_from_csv(csv_filename, delim=','):
 with open(csv_filename, 'r') as file:
 times_list = []
 for line in file:
 # skip header
 if 'task' in line:
 continue
 try:
 start_str, stop_str = line.split(delim)[3:5]
 start_time = datetime.datetime.strptime(start_str, '%Y-%m-%d %H:%M')
 stop_time = datetime.datetime.strptime(stop_str, '%Y-%m-%d %H:%M')
 times_list.append([start_time, stop_time])
 except:
 print(f'unable to parse this line: {line}')
 return times_list
def summarize_timestamps(timestamp_pairs):
 summary_stamps = []
 for stamp_pair in timestamp_pairs:
 # check if date is already in summary_stamps
 if date_is_present(stamp_pair[0], summary_stamps):
 # if so, add time and change end time
 date_idx = get_date_index(stamp_pair[0], summary_stamps)
 if summary_stamps[date_idx]['date'] == stamp_pair[0].date():
 new_start = min(stamp_pair[0].time(), summary_stamps[date_idx]['start']) 
 new_stop = max(stamp_pair[1].time(), summary_stamps[date_idx]['stop'])
 summary_stamps[date_idx]['start'] = new_start
 summary_stamps[date_idx]['stop'] = new_stop
 summary_stamps[date_idx]['total'] += stamp_pair[1] - stamp_pair[0]
 else:
 # if not, add a summary_stamp with start time, end time and time
 summary_stamps.append({'date':stamp_pair[0].date(),
 'start':stamp_pair[0].time(),
 'stop':stamp_pair[1].time(),
 'total':stamp_pair[1] - stamp_pair[0]
 })
 # add break field
 for s, sumst in enumerate(summary_stamps):
 stop_start_diff = datetime.datetime.combine(sumst['date'], sumst['stop']) - datetime.datetime.combine(sumst['date'], sumst['start'])
 pause_time = stop_start_diff - sumst['total']
 summary_stamps[s]['pause_min'], _ = divmod(pause_time.seconds, 60)
 return summary_stamps
def date_is_present(timestamp, summary_stamps):
 if summary_stamps == []:
 return False
 for summary_stamp in summary_stamps:
 if summary_stamp['date'] == timestamp.date():
 return True
 # if no date is present:
 return False
def get_date_index(timestamp, summary_stamps):
 for s, summary_stamp in enumerate(summary_stamps):
 if summary_stamp['date'] == timestamp.date():
 return s
def parse_summary_stamps_to_entries(summary_stamps):
 entry_list = [[] for i in range(len(summary_stamps))]
 for s, sumst in enumerate(summary_stamps):
 total_hours, rem = divmod(sumst['total'].seconds, 3600)
 total_minutes, _ = divmod(rem, 60)
 entry_list[s] = [
 sumst['date'].strftime('%d.%m.'),
 sumst['start'].strftime('%H:%M'),
 sumst['stop'].strftime('%H:%M'),
 sumst['pause_min'],
 f'{total_hours:02}:{total_minutes:02}'
 ]
 return entry_list
def sort_entries_by_date(entry_list):
 return sorted(entry_list, key=itemgetter(0))
def write_times_to_csv(sorted_entries, fname_out, delim=','):
 with open(fname_out, mode='w') as file:
 writer = csv.writer(file, delimiter=delim)
 for entry in sorted_entries:
 writer.writerow(entry)
 print(f'wrote csv file: {fname_out}')
if __name__ == '__main__':
 fname_in = 'testoutput.csv'
 fname_out = 'testoutput_parsed.csv'
 timestamp_pairs = read_timestamps_from_csv(fname_in)
 summary_stamps =summarize_timestamps(timestamp_pairs)
 entry_list = parse_summary_stamps_to_entries(summary_stamps)
 sorted_entries = sort_entries_by_date(entry_list)
 header = ['date', 'start', 'stop', 'pause (minutes)', 'total (hours)']
 #print(header)
 #for entry in sorted_entries:
 # print(entry)
 sorted_entries = [header, *sorted_entries]
 write_times_to_csv(sorted_entries, fname_out)

Question 2

Pathlib

Rather than accepting csv_filename as a string, accept it as a Path. Then

with open(csv_filename, 'r') as file:

becomes

with csv_filename.open() as file:

Generator

Turn read_timestamps_from_csv into a generator that yields 2-tuples:

from datetime import datetime
# ...
DATE_FMT = '%Y-%m-%d %H:%M'
def read_timestamps_from_csv(csv_filename: Path, delim: str=',') -> Iterable[Tuple[datetime, datetime]]:
 with csv_filename.open() as file:
 for line in file:
 # skip header
 if 'task' in line:
 continue
 try:
 start_str, stop_str = line.split(delim)[3:5]
 start_time = datetime.strptime(start_str, DATE_FMT)
 stop_time = datetime.strptime(stop_str, DATE_FMT)
 yield start_time, stop_time
 except Exception:
 print(f'Unable to parse this line: "{line}"')

Also note:

Factor out a formatting constant
Importing of the datetime symbol
Never except; at least catch an Exception if not a more narrow type
Some type hints

`summarize_timestamps` / `date_is_present`

This:

for summary_stamp in summary_stamps:
 if summary_stamp['date'] == timestamp.date():
 return True

is a search in linear time - O(n) complexity, which is slow. To reduce this to constant time, or O(1), use a set of dates and the in operator; or alternately maintain a dictionary whose keys are the dates.

The dictionary approach would simplify your code in summarize_timestamps. You have two loops. The first loop would still need to keep a dictionary since you're going back and mutating entries before being able to yield them.

Then, your last loop can further mutate to add the break field and yield there.

This can be more simplified if - instead of using a dictionary - you use an actual class, with attribute of date, start, stop and total. Also, this loop:

for stamp_pair in timestamp_pairs:

should immediately unpack that pair, i.e.

for start, stop in timestamp_pairs:

Time math

 summary_stamps[s]['pause_min'], _ = divmod(pause_time.seconds, 60)

is a red flag.

You're throwing out the second return value from divmod, so why call it at all? If you still wanted to do your own math, just use integer division - //. However, you should nearly never be doing your own time math.

This is one of the many things that C# does better out-of-the-box than Python, but anyway: reading this documentation, the recommended method (without bringing in a third-party lib) is:

summary_stamps[s]['pause_min'] = pause_time // timedelta(minutes=1)

The same goes for

 total_hours, rem = divmod(sumst['total'].seconds, 3600)
 total_minutes, _ = divmod(rem, 60)

Redundant `if`

This:

if summary_stamps == []:
 return False

should get deleted, because if that list is empty, the iteration will execute zero times and the return will be the same.

However, the whole function can be replaced with

td = timestamp.date()
return any(stamp['date'] == td for stamp in summary_stamps)

Sorting

Have you tried

return sorted(entry_list, key=itemgetter(0))

without the key? The default behaviour is to sort on the first item of a tuple.

Question 3

Thank you very much, your advice is very helpful to me! How exactly can summarize_timestamps or summary_stamps be turned into a generator? The difference to me is that for each function call, there may be several inputs necessary to generate a single output. This would require multiple runs of the outer for loop. Did I miss something?

Question 4

Also, the type hints for read_timestamps_from_csv seem to require me to import Iterable and Tuple. Is it generally recommended to add imports for type hints?

Question 5

Is it generally recommended to add imports for type hints? - Using type hints, in general, is a good idea. There are some differing opinions on how this is done. Some prefer string-style type hints (i.e. csv_filename: 'Path') and others bare symbols as I have shown. Some prefer that the import from typing be done in an unconditional manner, and others prefer to wrap their typing imports in an if typing.TYPE_CHECKING clause. Some prefer qualified type references like typing.Tuple and others just Tuple.

Question 6

You can experiment a little, but my preference is - bare, not string; unconditional; and unqualified.

Question 7

As for summary_stamps - I've made an edit.

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2020-04-06 01:12:48Z

Pathlib

Rather than accepting csv_filename as a string, accept it as a Path. Then

with open(csv_filename, 'r') as file:

becomes

with csv_filename.open() as file:

Generator

Turn read_timestamps_from_csv into a generator that yields 2-tuples:

from datetime import datetime
# ...
DATE_FMT = '%Y-%m-%d %H:%M'
def read_timestamps_from_csv(csv_filename: Path, delim: str=',') -> Iterable[Tuple[datetime, datetime]]:
 with csv_filename.open() as file:
 for line in file:
 # skip header
 if 'task' in line:
 continue
 try:
 start_str, stop_str = line.split(delim)[3:5]
 start_time = datetime.strptime(start_str, DATE_FMT)
 stop_time = datetime.strptime(stop_str, DATE_FMT)
 yield start_time, stop_time
 except Exception:
 print(f'Unable to parse this line: "{line}"')

Also note:

Factor out a formatting constant
Importing of the datetime symbol
Never except; at least catch an Exception if not a more narrow type
Some type hints

`summarize_timestamps` / `date_is_present`

This:

for summary_stamp in summary_stamps:
 if summary_stamp['date'] == timestamp.date():
 return True

is a search in linear time - O(n) complexity, which is slow. To reduce this to constant time, or O(1), use a set of dates and the in operator; or alternately maintain a dictionary whose keys are the dates.

The dictionary approach would simplify your code in summarize_timestamps. You have two loops. The first loop would still need to keep a dictionary since you're going back and mutating entries before being able to yield them.

Then, your last loop can further mutate to add the break field and yield there.

This can be more simplified if - instead of using a dictionary - you use an actual class, with attribute of date, start, stop and total. Also, this loop:

for stamp_pair in timestamp_pairs:

should immediately unpack that pair, i.e.

for start, stop in timestamp_pairs:

Time math

 summary_stamps[s]['pause_min'], _ = divmod(pause_time.seconds, 60)

is a red flag.

You're throwing out the second return value from divmod, so why call it at all? If you still wanted to do your own math, just use integer division - //. However, you should nearly never be doing your own time math.

This is one of the many things that C# does better out-of-the-box than Python, but anyway: reading this documentation, the recommended method (without bringing in a third-party lib) is:

summary_stamps[s]['pause_min'] = pause_time // timedelta(minutes=1)

The same goes for

 total_hours, rem = divmod(sumst['total'].seconds, 3600)
 total_minutes, _ = divmod(rem, 60)

Redundant `if`

This:

if summary_stamps == []:
 return False

should get deleted, because if that list is empty, the iteration will execute zero times and the return will be the same.

However, the whole function can be replaced with

td = timestamp.date()
return any(stamp['date'] == td for stamp in summary_stamps)

Sorting

Have you tried

return sorted(entry_list, key=itemgetter(0))

without the key? The default behaviour is to sort on the first item of a tuple.

Thank you very much, your advice is very helpful to me! How exactly can summarize_timestamps or summary_stamps be turned into a generator? The difference to me is that for each function call, there may be several inputs necessary to generate a single output. This would require multiple runs of the outer for loop. Did I miss something?
Also, the type hints for read_timestamps_from_csv seem to require me to import Iterable and Tuple. Is it generally recommended to add imports for type hints?
Is it generally recommended to add imports for type hints? - Using type hints, in general, is a good idea. There are some differing opinions on how this is done. Some prefer string-style type hints (i.e. csv_filename: 'Path') and others bare symbols as I have shown. Some prefer that the import from typing be done in an unconditional manner, and others prefer to wrap their typing imports in an if typing.TYPE_CHECKING clause. Some prefer qualified type references like typing.Tuple and others just Tuple.
You can experiment a little, but my preference is - bare, not string; unconditional; and unqualified.

Stack Exchange Network

Parse org-mode clocktable CSVs into time sheets

1 Answer 1

Pathlib

Generator

`summarize_timestamps` / `date_is_present`

Time math

Redundant `if`

Sorting

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse org-mode clocktable CSVs into time sheets

1 Answer 1

Pathlib

Generator

summarize_timestamps / date_is_present

Time math

Redundant if

Sorting

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`summarize_timestamps` / `date_is_present`

Redundant `if`