Reading groups of files and concatenating them

Question 1

I have made some adjustments to some code that you can see on this thread:

I would like to make some further refinements and make sure I am on the right track organize some of my files better. I added some functions and made some other changes to prevent errors and make my code easier to reuse (and to be more Pythonic).

#got rid of import *
import pandas as pd
import numpy as np
import datetime as dt
ftploc = r'C:\Users\FTP\\'
loc = r'C:\Users\\'
splitsname = 'Splits'
fcrname = 'fcr_report_'
npsname = 'csat_report_'
ahtname = 'aht_report_'
rostername = 'Daily_Roster'
vasname = 'vas_report_'
ext ='.csv'
#had to create some periods and date format parameters
start_period = '13 day'
end_period = '1 day'
fcr_period = '3 day'
date_format1 = '%m_%d_%Y'
date_format2 = '%Y_%m_%d'
start_date = dt.date.today() - pd.Timedelta(start_period)
end_date = dt.date.today() - pd.Timedelta(end_period)
fcr_end_date = end_date - pd.Timedelta(fcr_period)
daterange1 = pd.Timestamp(end_date) - pd.Timestamp(start_date)
daterange2 = pd.Timestamp(fcr_end_date) - pd.Timestamp(start_date)
daterange1 = (daterange1 / np.timedelta64(1, 'D')).astype(int)
daterange2 = (daterange2 / np.timedelta64(1, 'D')).astype(int)
print('Starting scrubbing file...')
#AHT files have a different date format in the filename so I made this function
def dateFormat(filename):
 if filename == ahtname:
 return date_format2
 else:
 return date_format1
#FCR is 3 days delayed (72 hour window) so I needed to create some logic to adjust for it
def dateRange(filename):
 if filename == fcrname:
 return daterange2
 else:
 return daterange1
#this function works on all of my files now. I just wonder if there is a better way to refer to the other functions? Is having a separate function for the date range and format ideal?
def readAndConcatFile(filename, daterange):
 df_list = []
 try:
 for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
 df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = True)
 df_list.append(df)
 return pd.concat(df_list)
 except IOError:
 print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)
#this appears to work great. I lose the ability to read certain columns though or to specify dtypes for specific columns
nps = readAndConcatFile(npsname, daterange)
vas = readAndConcatFile(vasname, daterange)
fcr = readAndConcatFile(fcrname, daterange)
aht = readAndConcatFile(ahtname, daterange)

Question 2

Looks good! When you say "I lose the ability to read certain columns though or to specify dtypes for specific columns", could you explain why that is? Do you know about how to add optional parameters to functions?

Question 3

I mean a certain column might have a a group of numbers 72627877 which should be read as an object instead of integers. And one of my files has like 50 columns but I only care about 10 or so of them. Can I make different options? Would it be the same approach I took for date format?

Question 4

yeah, if you could show me some examples of optional parameters using my example above i think it would help me get started so I can think about them in the right way. I looked them up on the documentation and I see what they are but I am having trouble applying them to my own code..

Question 5

I have an answer posted. I'm not at all familiar with pandas or read_csv and don't entirely follow what you're doing so it's my best attempt at using defaults without entirely understanding your intentions. If it's unclear please ask for clarifications but also please correct my assumptions so I can provide a better explanation.

Question 6

Please note that I'm not sure if I'm reading your script correctly. I compared this to the original to see what's lost from how you called the functions. If I've misunderstood entirely please let me know.

In your original script one of your read_csv calls passed in a 'date_completed' key that you left out here to use one function for all files, but you can still get that information using a default value. Default values can be included in a function parameter list so even if they're not supplied, the variable will exist. In your case it would be good to have a value for parse_dates.

def readAndConcatFile(filename, daterange, parse_dates=True):
 df_list = []
 try:
 for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
 df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
 df_list.append(df)
 return pd.concat(df_list)
 except IOError:
 print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)

This means that in the absence of any value True will be passed to parse_dates as you're currently doing. However you could also pass specific parameters like you did previously.

nps = readAndConcatFile(npsname, daterange, ['call_date','date_completed'])
vas = readAndConcatFile(vasname, daterange, ['Call_date'])
fcr = readAndConcatFile(fcrname, daterange, ['call_time'])
aht = readAndConcatFile(ahtname, daterange)

However I noticed that you previously passed nothing at all to your call for aht whereas you're now passing True. If that's something you'd like to avoid, that's easy with a slight modification. When you want to use a default to make a parameter optional, set the default as None and then you can have a line where you test whether there was a parameter passed or not.

def readAndConcatFile(filename, daterange, parse_dates=None):
 df_list = []
 try:
 for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
 if parse_dates is None:
 df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext)
 else:
 df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
 df_list.append(df)
 return pd.concat(df_list)
 except IOError:
 print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)

This means you no longer have to pass True to parse_dates for aht just to have the function work.

Question 7

I have used this function for so much now, thanks a ton. I did end up putting the try: except within the for date_range loop though. My issue was that sometimes a file was missing for a certain date (our client would just put that data as part of the next day) which would cause the df to be None. Placing the try: after the for loop allows me to create a dataframe still at least.

score 2 · Accepted Answer · 2015-09-11 16:01:20Z

Please note that I'm not sure if I'm reading your script correctly. I compared this to the original to see what's lost from how you called the functions. If I've misunderstood entirely please let me know.

In your original script one of your read_csv calls passed in a 'date_completed' key that you left out here to use one function for all files, but you can still get that information using a default value. Default values can be included in a function parameter list so even if they're not supplied, the variable will exist. In your case it would be good to have a value for parse_dates.

def readAndConcatFile(filename, daterange, parse_dates=True):
 df_list = []
 try:
 for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
 df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
 df_list.append(df)
 return pd.concat(df_list)
 except IOError:
 print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)

This means that in the absence of any value True will be passed to parse_dates as you're currently doing. However you could also pass specific parameters like you did previously.

nps = readAndConcatFile(npsname, daterange, ['call_date','date_completed'])
vas = readAndConcatFile(vasname, daterange, ['Call_date'])
fcr = readAndConcatFile(fcrname, daterange, ['call_time'])
aht = readAndConcatFile(ahtname, daterange)

However I noticed that you previously passed nothing at all to your call for aht whereas you're now passing True. If that's something you'd like to avoid, that's easy with a slight modification. When you want to use a default to make a parameter optional, set the default as None and then you can have a line where you test whether there was a parameter passed or not.

def readAndConcatFile(filename, daterange, parse_dates=None):
 df_list = []
 try:
 for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
 if parse_dates is None:
 df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext)
 else:
 df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
 df_list.append(df)
 return pd.concat(df_list)
 except IOError:
 print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)

This means you no longer have to pass True to parse_dates for aht just to have the function work.

I have used this function for so much now, thanks a ton. I did end up putting the try: except within the for date_range loop though. My issue was that sometimes a file was missing for a certain date (our client would just put that data as part of the next day) which would cause the df to be None. Placing the try: after the for loop allows me to create a dataframe still at least.

Stack Exchange Network

Reading groups of files and concatenating them

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Reading groups of files and concatenating them

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions