Read CSV, with date filtering and resampling

Question 1

I have written the following function to read multiple CSV files into pandas dataframes. Depending on the use case, the user can pass optional resampling (frequency and method) and/or a date range (start/end). For both options I'd like to check if both keywords were given and raise errors if not.

My issue is that reading the CSV files can potentially take quite a bit of time and it's quite frustrating to get the value error 5 minutes after you've called the function. I could duplicate the if statements at the top of the function. However I'd like to know if there is a more readable way or a best practice that avoids having the same if/else statements multiple times.

def file_loader(filep,freq:str = None, method: str =None ,d_debut: str =None,d_fin: str =None):
 df= pd.read_csv(filep,\
 index_col=0,infer_datetime_format=True,parse_dates=[0],\
 header=0,names=['date',filep.split('.')[0]])\
 .sort_index()
 if d_debut is not None:
 if d_fin is None:
 raise ValueError("Please provide an end timestamp!")
 else:
 df=df.loc[ (df.index >= d_debut) & (df.index <= d_fin)]
 
 if freq is not None:
 if method is None:
 raise ValueError("Please provide a resampling method for the given frequency eg. 'last' ,'mean'")
 else:
 ## getattr sert à appeler ...resample(freq).last() etc avec les kwargs en string ex: freq='1D' et method ='mean'
 df= getattr(df.resample(freq), method)()
 return df

Question 2

It's normal to do the parameter checking first:

def file_loader(filep, freq: str = None, method: str = None,
 d_debut: str = None, d_fin: str = None):
 if d_debut is not None and d_fin is None:
 raise ValueError("Please provide an end timestamp!")
 if freq is not None and method is None:
 raise ValueError("Please provide a resampling method for the given frequency e.g. 'last' ,'mean'")

You might prefer to check that d_fin and d_debut are either both provided or both defaulted, rather than allowing d_fin without d_debut as at present:

 if (d_debut is None) != (d_fin is None):
 raise ValueError("Please provide both start and end timestamps!")

Then after loading, the conditionals are simple:

 if d_debut is not None:
 assert(d_fin is not None) # would have thrown earlier
 df = df.loc[df.index >= d_debut and df.index <= d_fin]
 
 if freq is not None:
 assert(method is not None) # would have thrown earlier
 df = getattr(df.resample(freq), method)()

The assertions are there to document something we know to be true. You could omit them safely, but they do aid understanding.

It may make sense to provide a useful default method rather than None. Similarly, could the df.index >= d_debut and df.index <= d_fin test be adapted to be happy with one or the other cutoff missing? For example:

 if d_debut is not None or d_fin is not None:
 df = df.loc[(d_debut is None or df.index >= d_debut) and 
 (d_fin is None or df.index <= d_fin)]

Then we wouldn't need the parameter checks at all.

Code style - please take more care with whitespace (consult PEP-8) for maximum readability. Lots of this code seems unnecessarily bunched-up.

Question 3

Thanks for your answer. I'd rather the function force the user to understand exactly what data he is getting, hence the strict checks. I don't understand why you have assert statements. As you've commented, errors would have been raised at the top. And my bad for the styling.

Question 4

The asserts are there to document something we know to be true. You could omit them safely, but they do aid understanding.

Toby Speight Toby Speight 87.2k14 gold badges104 silver badges322 bronze badges · Accepted Answer · 2021-10-06 14:20:00Z

It's normal to do the parameter checking first:

def file_loader(filep, freq: str = None, method: str = None,
 d_debut: str = None, d_fin: str = None):
 if d_debut is not None and d_fin is None:
 raise ValueError("Please provide an end timestamp!")
 if freq is not None and method is None:
 raise ValueError("Please provide a resampling method for the given frequency e.g. 'last' ,'mean'")

You might prefer to check that d_fin and d_debut are either both provided or both defaulted, rather than allowing d_fin without d_debut as at present:

 if (d_debut is None) != (d_fin is None):
 raise ValueError("Please provide both start and end timestamps!")

Then after loading, the conditionals are simple:

 if d_debut is not None:
 assert(d_fin is not None) # would have thrown earlier
 df = df.loc[df.index >= d_debut and df.index <= d_fin]
 
 if freq is not None:
 assert(method is not None) # would have thrown earlier
 df = getattr(df.resample(freq), method)()

The assertions are there to document something we know to be true. You could omit them safely, but they do aid understanding.

It may make sense to provide a useful default method rather than None. Similarly, could the df.index >= d_debut and df.index <= d_fin test be adapted to be happy with one or the other cutoff missing? For example:

 if d_debut is not None or d_fin is not None:
 df = df.loc[(d_debut is None or df.index >= d_debut) and 
 (d_fin is None or df.index <= d_fin)]

Then we wouldn't need the parameter checks at all.

Code style - please take more care with whitespace (consult PEP-8) for maximum readability. Lots of this code seems unnecessarily bunched-up.

Thanks for your answer. I'd rather the function force the user to understand exactly what data he is getting, hence the strict checks. I don't understand why you have assert statements. As you've commented, errors would have been raised at the top. And my bad for the styling.
The asserts are there to document something we know to be true. You could omit them safely, but they do aid understanding.

Stack Exchange Network

Read CSV, with date filtering and resampling

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Read CSV, with date filtering and resampling

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions