5
\$\begingroup\$

I am reading a large data file where the time is given in number of days since some epoch. I am currently converting this to Python's datetime format using this function:

import datetime as dt
def days2dt(days_since_epoch):
 epoch = dt.datetime(1980, 1, 6)
 datelist = [epoch + dt.timedelta(days=x) for x in days_since_epoch]
 return datelist
# run with sample data (might be larger in real life, in worst case multiply
# the list by 40 instead of 6)
import numpy as np
sample = list(np.arange(0, 3/24., 1/24./3600./50.))*6
dates = days2dt(sample)

Running this function takes 5x longer than reading the entire file using pandas.read_csv() (perhaps because the listcomp performs an addition for each element). The returned list is used immediately as the index of the pandas DataFrame, though interestingly, using a generator expression instead of a listcomp as above improves performance by ~35% (why?).

Aside from using a generator expression, can the performance of this function be improved in any way, e.g. by not performing this date conversion per-element or by using some NumPy feature I'm not aware of?

asked Jan 15, 2015 at 9:36
\$\endgroup\$
2
  • \$\begingroup\$ Do you have example input (in size and distribution)? \$\endgroup\$ Commented Jan 15, 2015 at 20:37
  • \$\begingroup\$ @Veedrac See the updated question. The data come from GPS satellites, let's say they are 50 Hz resolution and the satellite is visible for 3 hours 2 times a day. Then the example code I added would be 3 days of data from one satellite (the values would of course not be repeated like that in real life, but the example gives the correct amount of data as well as the fact that they are not entirely contiguous). \$\endgroup\$ Commented Jan 15, 2015 at 21:16

1 Answer 1

4
\$\begingroup\$

You should try Numpy's datetime and timedelta support:

def days2dt(days_since_epoch):
 microseconds = np.around(np.asarray(days_since_epoch) * (24*60*60*10**6))
 return np.datetime64('1980-01-06') + microseconds.astype('timedelta64[us]')

I suggest you read the units section of the docs to make sure this is safe (both resolution and min/max dates), but it should be fine.

Note that 90% of the time taken for days2dt is converting the input list to a numpy.array; if you pass in a numpy.array it goes much faster. Nevertheless, this is significantly faster than the list comprehension already.

answered Jan 15, 2015 at 23:39
\$\endgroup\$
1
  • \$\begingroup\$ That's amazing, thanks! I notice a ~30x increase in speed for a relatively large array using code based on your suggestion. \$\endgroup\$ Commented Jan 16, 2015 at 10:15

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.