I have made some adjustments to some code that you can see on this thread:
Read daily files and concatenate them
I would like to make some further refinements and make sure I am on the right track organize some of my files better. I added some functions and made some other changes to prevent errors and make my code easier to reuse (and to be more Pythonic).
#got rid of import *
import pandas as pd
import numpy as np
import datetime as dt
ftploc = r'C:\Users\FTP\\'
loc = r'C:\Users\\'
splitsname = 'Splits'
fcrname = 'fcr_report_'
npsname = 'csat_report_'
ahtname = 'aht_report_'
rostername = 'Daily_Roster'
vasname = 'vas_report_'
ext ='.csv'
#had to create some periods and date format parameters
start_period = '13 day'
end_period = '1 day'
fcr_period = '3 day'
date_format1 = '%m_%d_%Y'
date_format2 = '%Y_%m_%d'
start_date = dt.date.today() - pd.Timedelta(start_period)
end_date = dt.date.today() - pd.Timedelta(end_period)
fcr_end_date = end_date - pd.Timedelta(fcr_period)
daterange1 = pd.Timestamp(end_date) - pd.Timestamp(start_date)
daterange2 = pd.Timestamp(fcr_end_date) - pd.Timestamp(start_date)
daterange1 = (daterange1 / np.timedelta64(1, 'D')).astype(int)
daterange2 = (daterange2 / np.timedelta64(1, 'D')).astype(int)
print('Starting scrubbing file...')
#AHT files have a different date format in the filename so I made this function
def dateFormat(filename):
if filename == ahtname:
return date_format2
else:
return date_format1
#FCR is 3 days delayed (72 hour window) so I needed to create some logic to adjust for it
def dateRange(filename):
if filename == fcrname:
return daterange2
else:
return daterange1
#this function works on all of my files now. I just wonder if there is a better way to refer to the other functions? Is having a separate function for the date range and format ideal?
def readAndConcatFile(filename, daterange):
df_list = []
try:
for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = True)
df_list.append(df)
return pd.concat(df_list)
except IOError:
print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)
#this appears to work great. I lose the ability to read certain columns though or to specify dtypes for specific columns
nps = readAndConcatFile(npsname, daterange)
vas = readAndConcatFile(vasname, daterange)
fcr = readAndConcatFile(fcrname, daterange)
aht = readAndConcatFile(ahtname, daterange)
1 Answer 1
Please note that I'm not sure if I'm reading your script correctly. I compared this to the original to see what's lost from how you called the functions. If I've misunderstood entirely please let me know.
In your original script one of your read_csv
calls passed in a 'date_completed'
key that you left out here to use one function for all files, but you can still get that information using a default value. Default values can be included in a function parameter list so even if they're not supplied, the variable will exist. In your case it would be good to have a value for parse_dates
.
def readAndConcatFile(filename, daterange, parse_dates=True):
df_list = []
try:
for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
df_list.append(df)
return pd.concat(df_list)
except IOError:
print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)
This means that in the absence of any value True
will be passed to parse_dates
as you're currently doing. However you could also pass specific parameters like you did previously.
nps = readAndConcatFile(npsname, daterange, ['call_date','date_completed'])
vas = readAndConcatFile(vasname, daterange, ['Call_date'])
fcr = readAndConcatFile(fcrname, daterange, ['call_time'])
aht = readAndConcatFile(ahtname, daterange)
However I noticed that you previously passed nothing at all to your call for aht
whereas you're now passing True
. If that's something you'd like to avoid, that's easy with a slight modification. When you want to use a default to make a parameter optional, set the default as None
and then you can have a line where you test whether there was a parameter passed or not.
def readAndConcatFile(filename, daterange, parse_dates=None):
df_list = []
try:
for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
if parse_dates is None:
df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext)
else:
df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
df_list.append(df)
return pd.concat(df_list)
except IOError:
print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)
This means you no longer have to pass True
to parse_dates
for aht
just to have the function work.
-
1\$\begingroup\$ I have used this function for so much now, thanks a ton. I did end up putting the try: except within the for date_range loop though. My issue was that sometimes a file was missing for a certain date (our client would just put that data as part of the next day) which would cause the df to be None. Placing the try: after the for loop allows me to create a dataframe still at least. \$\endgroup\$trench– trench2016年01月27日 14:52:27 +00:00Commented Jan 27, 2016 at 14:52
read_csv
and don't entirely follow what you're doing so it's my best attempt at using defaults without entirely understanding your intentions. If it's unclear please ask for clarifications but also please correct my assumptions so I can provide a better explanation. \$\endgroup\$