For my student job, I have been logging work times with the org-mode
in emacs for quite some time. Now since I can only work from remote, I figured it would be nice to automatically use the entries from the .org files into readily-formatted entries. I am doing this because the job requires me to write an Excel sheet with the hours worked each month in a specific format:
- For each day, there is one entry
- Each entry features the date, start and ending time, pause in minutes, and hours worked.
With the help of the Emacs package org-clock-csv
I was able to generate CSV output containing start and end times including dates. I wrote a Python script to parse these into the desired format, and I feel there is a lot of room for improvement.
The input looks like this (testoutput.csv
):
organization,,,2020年04月03日 10:49,2020年04月03日 13:19,,,
some stuff,,,2020年04月03日 10:39,2020年04月03日 10:49,,,
more stuff,,,2020年04月02日 12:25,2020年04月02日 12:25,,,
some stuff,,,2020年04月02日 09:43,2020年04月02日 09:47,,,
other stuff,,,2020年04月02日 09:35,2020年04月02日 09:43,,,
organization,,,2020年03月27日 14:00,2020年03月27日 14:28,,,
Orga,,,2020年03月27日 09:10,2020年03月27日 09:42,,,
Orga,,,2020年03月23日 09:13,2020年03月23日 09:25,,,
Orga,,,2020年03月22日 09:56,2020年03月22日 10:03,,,
There are several things the code needs to do: summarize entries for each day, parse the times and dates, and compute total time worked as well as pause times. Pause times are the result of the difference of (latest end - earliest start) and the actual total time worked.
The result should look like this (testoutput_parsed.csv
, actual output of my script):
date,start,stop,pause (minutes),total (hours)
02.04.,09:35,12:25,158,00:12
03.04.,10:39,13:19,0,02:40
22.03.,09:56,10:03,0,00:07
23.03.,09:13,09:25,0,00:12
27.03.,09:10,14:28,258,01:00
As far as I can tell, the output is correct. However, I am looking for comments on code quality in terms of structure, complying with conventions and such.
Here is the actual code:
import datetime
from operator import itemgetter
import csv
def read_timestamps_from_csv(csv_filename, delim=','):
with open(csv_filename, 'r') as file:
times_list = []
for line in file:
# skip header
if 'task' in line:
continue
try:
start_str, stop_str = line.split(delim)[3:5]
start_time = datetime.datetime.strptime(start_str, '%Y-%m-%d %H:%M')
stop_time = datetime.datetime.strptime(stop_str, '%Y-%m-%d %H:%M')
times_list.append([start_time, stop_time])
except:
print(f'unable to parse this line: {line}')
return times_list
def summarize_timestamps(timestamp_pairs):
summary_stamps = []
for stamp_pair in timestamp_pairs:
# check if date is already in summary_stamps
if date_is_present(stamp_pair[0], summary_stamps):
# if so, add time and change end time
date_idx = get_date_index(stamp_pair[0], summary_stamps)
if summary_stamps[date_idx]['date'] == stamp_pair[0].date():
new_start = min(stamp_pair[0].time(), summary_stamps[date_idx]['start'])
new_stop = max(stamp_pair[1].time(), summary_stamps[date_idx]['stop'])
summary_stamps[date_idx]['start'] = new_start
summary_stamps[date_idx]['stop'] = new_stop
summary_stamps[date_idx]['total'] += stamp_pair[1] - stamp_pair[0]
else:
# if not, add a summary_stamp with start time, end time and time
summary_stamps.append({'date':stamp_pair[0].date(),
'start':stamp_pair[0].time(),
'stop':stamp_pair[1].time(),
'total':stamp_pair[1] - stamp_pair[0]
})
# add break field
for s, sumst in enumerate(summary_stamps):
stop_start_diff = datetime.datetime.combine(sumst['date'], sumst['stop']) - datetime.datetime.combine(sumst['date'], sumst['start'])
pause_time = stop_start_diff - sumst['total']
summary_stamps[s]['pause_min'], _ = divmod(pause_time.seconds, 60)
return summary_stamps
def date_is_present(timestamp, summary_stamps):
if summary_stamps == []:
return False
for summary_stamp in summary_stamps:
if summary_stamp['date'] == timestamp.date():
return True
# if no date is present:
return False
def get_date_index(timestamp, summary_stamps):
for s, summary_stamp in enumerate(summary_stamps):
if summary_stamp['date'] == timestamp.date():
return s
def parse_summary_stamps_to_entries(summary_stamps):
entry_list = [[] for i in range(len(summary_stamps))]
for s, sumst in enumerate(summary_stamps):
total_hours, rem = divmod(sumst['total'].seconds, 3600)
total_minutes, _ = divmod(rem, 60)
entry_list[s] = [
sumst['date'].strftime('%d.%m.'),
sumst['start'].strftime('%H:%M'),
sumst['stop'].strftime('%H:%M'),
sumst['pause_min'],
f'{total_hours:02}:{total_minutes:02}'
]
return entry_list
def sort_entries_by_date(entry_list):
return sorted(entry_list, key=itemgetter(0))
def write_times_to_csv(sorted_entries, fname_out, delim=','):
with open(fname_out, mode='w') as file:
writer = csv.writer(file, delimiter=delim)
for entry in sorted_entries:
writer.writerow(entry)
print(f'wrote csv file: {fname_out}')
if __name__ == '__main__':
fname_in = 'testoutput.csv'
fname_out = 'testoutput_parsed.csv'
timestamp_pairs = read_timestamps_from_csv(fname_in)
summary_stamps =summarize_timestamps(timestamp_pairs)
entry_list = parse_summary_stamps_to_entries(summary_stamps)
sorted_entries = sort_entries_by_date(entry_list)
header = ['date', 'start', 'stop', 'pause (minutes)', 'total (hours)']
#print(header)
#for entry in sorted_entries:
# print(entry)
sorted_entries = [header, *sorted_entries]
write_times_to_csv(sorted_entries, fname_out)
1 Answer 1
Pathlib
Rather than accepting csv_filename
as a string, accept it as a Path
. Then
with open(csv_filename, 'r') as file:
becomes
with csv_filename.open() as file:
Generator
Turn read_timestamps_from_csv
into a generator that yields 2-tuples:
from datetime import datetime
# ...
DATE_FMT = '%Y-%m-%d %H:%M'
def read_timestamps_from_csv(csv_filename: Path, delim: str=',') -> Iterable[Tuple[datetime, datetime]]:
with csv_filename.open() as file:
for line in file:
# skip header
if 'task' in line:
continue
try:
start_str, stop_str = line.split(delim)[3:5]
start_time = datetime.strptime(start_str, DATE_FMT)
stop_time = datetime.strptime(stop_str, DATE_FMT)
yield start_time, stop_time
except Exception:
print(f'Unable to parse this line: "{line}"')
Also note:
- Factor out a formatting constant
- Importing of the
datetime
symbol - Never
except
; at least catch anException
if not a more narrow type - Some type hints
summarize_timestamps
/ date_is_present
This:
for summary_stamp in summary_stamps:
if summary_stamp['date'] == timestamp.date():
return True
is a search in linear time - O(n) complexity, which is slow. To reduce this to constant time, or O(1), use a set
of dates and the in
operator; or alternately maintain a dictionary whose keys are the dates.
The dictionary approach would simplify your code in summarize_timestamps
. You have two loops. The first loop would still need to keep a dictionary since you're going back and mutating entries before being able to yield them.
Then, your last loop can further mutate to add the break field and yield there.
This can be more simplified if - instead of using a dictionary - you use an actual class, with attribute of date
, start
, stop
and total
. Also, this loop:
for stamp_pair in timestamp_pairs:
should immediately unpack that pair, i.e.
for start, stop in timestamp_pairs:
Time math
summary_stamps[s]['pause_min'], _ = divmod(pause_time.seconds, 60)
is a red flag.
You're throwing out the second return value from divmod
, so why call it at all? If you still wanted to do your own math, just use integer division - //
. However, you should nearly never be doing your own time math.
This is one of the many things that C# does better out-of-the-box than Python, but anyway: reading this documentation, the recommended method (without bringing in a third-party lib) is:
summary_stamps[s]['pause_min'] = pause_time // timedelta(minutes=1)
The same goes for
total_hours, rem = divmod(sumst['total'].seconds, 3600)
total_minutes, _ = divmod(rem, 60)
Redundant if
This:
if summary_stamps == []:
return False
should get deleted, because if that list is empty, the iteration will execute zero times and the return will be the same.
However, the whole function can be replaced with
td = timestamp.date()
return any(stamp['date'] == td for stamp in summary_stamps)
Sorting
Have you tried
return sorted(entry_list, key=itemgetter(0))
without the key
? The default behaviour is to sort on the first item of a tuple.
-
\$\begingroup\$ Thank you very much, your advice is very helpful to me! How exactly can
summarize_timestamps
orsummary_stamps
be turned into a generator? The difference to me is that for each function call, there may be several inputs necessary to generate a single output. This would require multiple runs of the outer for loop. Did I miss something? \$\endgroup\$Jonas Schwarz– Jonas Schwarz2020年04月06日 14:52:57 +00:00Commented Apr 6, 2020 at 14:52 -
\$\begingroup\$ Also, the type hints for
read_timestamps_from_csv
seem to require me to importIterable
andTuple
. Is it generally recommended to add imports for type hints? \$\endgroup\$Jonas Schwarz– Jonas Schwarz2020年04月06日 14:55:24 +00:00Commented Apr 6, 2020 at 14:55 -
1\$\begingroup\$ Is it generally recommended to add imports for type hints? - Using type hints, in general, is a good idea. There are some differing opinions on how this is done. Some prefer string-style type hints (i.e.
csv_filename: 'Path'
) and others bare symbols as I have shown. Some prefer that the import fromtyping
be done in an unconditional manner, and others prefer to wrap their typing imports in anif typing.TYPE_CHECKING
clause. Some prefer qualified type references liketyping.Tuple
and others justTuple
. \$\endgroup\$Reinderien– Reinderien2020年04月06日 15:37:47 +00:00Commented Apr 6, 2020 at 15:37 -
1\$\begingroup\$ You can experiment a little, but my preference is - bare, not string; unconditional; and unqualified. \$\endgroup\$Reinderien– Reinderien2020年04月06日 15:37:50 +00:00Commented Apr 6, 2020 at 15:37
-
1\$\begingroup\$ As for
summary_stamps
- I've made an edit. \$\endgroup\$Reinderien– Reinderien2020年04月06日 16:00:21 +00:00Commented Apr 6, 2020 at 16:00