I have written the following function to read multiple CSV files into pandas dataframes. Depending on the use case, the user can pass optional resampling (frequency and method) and/or a date range (start/end). For both options I'd like to check if both keywords were given and raise errors if not.
My issue is that reading the CSV files can potentially take quite a bit of time and it's quite frustrating to get the value error 5 minutes after you've called the function. I could duplicate the if
statements at the top of the function. However I'd like to know if there is a more readable way or a best practice that avoids having the same if
/else
statements multiple times.
def file_loader(filep,freq:str = None, method: str =None ,d_debut: str =None,d_fin: str =None):
df= pd.read_csv(filep,\
index_col=0,infer_datetime_format=True,parse_dates=[0],\
header=0,names=['date',filep.split('.')[0]])\
.sort_index()
if d_debut is not None:
if d_fin is None:
raise ValueError("Please provide an end timestamp!")
else:
df=df.loc[ (df.index >= d_debut) & (df.index <= d_fin)]
if freq is not None:
if method is None:
raise ValueError("Please provide a resampling method for the given frequency eg. 'last' ,'mean'")
else:
## getattr sert à appeler ...resample(freq).last() etc avec les kwargs en string ex: freq='1D' et method ='mean'
df= getattr(df.resample(freq), method)()
return df
1 Answer 1
It's normal to do the parameter checking first:
def file_loader(filep, freq: str = None, method: str = None,
d_debut: str = None, d_fin: str = None):
if d_debut is not None and d_fin is None:
raise ValueError("Please provide an end timestamp!")
if freq is not None and method is None:
raise ValueError("Please provide a resampling method for the given frequency e.g. 'last' ,'mean'")
You might prefer to check that d_fin
and d_debut
are either both provided or both defaulted, rather than allowing d_fin
without d_debut
as at present:
if (d_debut is None) != (d_fin is None):
raise ValueError("Please provide both start and end timestamps!")
Then after loading, the conditionals are simple:
if d_debut is not None:
assert(d_fin is not None) # would have thrown earlier
df = df.loc[df.index >= d_debut and df.index <= d_fin]
if freq is not None:
assert(method is not None) # would have thrown earlier
df = getattr(df.resample(freq), method)()
The assertions are there to document something we know to be true. You could omit them safely, but they do aid understanding.
It may make sense to provide a useful default method
rather than None. Similarly, could the df.index >= d_debut and df.index <= d_fin
test be adapted to be happy with one or the other cutoff missing? For example:
if d_debut is not None or d_fin is not None:
df = df.loc[(d_debut is None or df.index >= d_debut) and
(d_fin is None or df.index <= d_fin)]
Then we wouldn't need the parameter checks at all.
Code style - please take more care with whitespace (consult PEP-8) for maximum readability. Lots of this code seems unnecessarily bunched-up.
-
\$\begingroup\$ Thanks for your answer. I'd rather the function force the user to understand exactly what data he is getting, hence the strict checks. I don't understand why you have assert statements. As you've commented, errors would have been raised at the top. And my bad for the styling. \$\endgroup\$kubatucka– kubatucka2021年10月06日 14:29:45 +00:00Commented Oct 6, 2021 at 14:29
-
\$\begingroup\$ The asserts are there to document something we know to be true. You could omit them safely, but they do aid understanding. \$\endgroup\$Toby Speight– Toby Speight2021年10月06日 14:40:38 +00:00Commented Oct 6, 2021 at 14:40