Using Pandas to merge .csv files

Question 1

Hope you can help me out with this one because it is really slow. Is there a way to do this without loading the whole .csv into memory?

The thing is... I have files containing timeseries data with 10 columns. First column is a datetime, last an integer, and the rest are floats

I am trying to join two .csv files together. The filenames are:

Myfile_1withdata
Myfile_1withdata1
Myfile_2withdata
Myfile_2withdata1
Myfile5_1withdata
Myfile5_1withdata1

etc...

The files with a "1" at the end is the new file containing updated data that I want to add (append) to files without 1 at the end like "Myfile5_1withdata.csv"

Files can weight up to 500MB and there are many of them and it takes a long time to finish this process... Can it be faster?

Currently I have tried accomplish this by doing:

import inspect
import pandas as pd
import glob, os
currentpath = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
type_names = {'1withdata':"super",'2withdata':"extra"}
file_names = ["Myfile","Myfile5"]
for a in file_names:
 for x in type_names.keys():
 results = pd.DataFrame([])
 for counter, file in enumerate(glob.glob(str(a)+'_'+str(x)+"*")):
 namedf = pd.read_csv(file, index_col=0,skiprows=0,dtype=str, usecols=[0,1,2,3,4,5,6,7,8,9],float_precision='high')
 results = results.append(namedf) # Dataframe with data of all file_names files with the same type_names key
 print("saving: ",a,x)
 results = results[~results.index.duplicated(keep='last')] #Remove duplicate row (last row with incomplete timeseries data)
 results.to_csv(a+'_'+str(x)+'.csv')
 print("DONE!")
#Cleanup by deleting data files with updated data (the ones ending with numbers)
files = [file for file in glob.glob(currentpath+"//*.csv") if file[-5:-4].isdigit() == True]
for file in files:
 os.remove(file)

@Edit 1 :

Here's an example of the data inside the files:

Myfile_1withdata.csv

The_Date_Time,Float1,Float2,Float3,Float4,Float5,Float6,Float7,Float8,Integer
31/10/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
30/11/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
31/12/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
31/01/2002 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
28/02/2002 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456

Myfile_1withdata1.csv

The_Date_Time,Float1,Float2,Float3,Float4,Float5,Float6,Float7,Float8,Integer
28/02/2002 22:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/03/2002 22:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
30/04/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/05/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
30/06/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/07/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910

So after the whole operation the Myfile_1withdata.csv should look like:

The_Date_Time,Float1,Float2,Float3,Float4,Float5,Float6,Float7,Float8,Integer
31/10/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
30/11/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
31/12/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
31/01/2002 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
28/02/2002 22:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/03/2002 22:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
30/04/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/05/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
30/06/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/07/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910

Question 2

Is there a reason you want to parse the files at all? Can't you just append the rows blindly without converting them to lists (possibly handling the headers separately)?

Question 3

More generally, can you show us an example of the first few lines of a file and its "1 at the end" version?

Question 4

#Cleanup by deleting data files with updated data (the ones ending with numbers) If all my expierience with handling data has learned me anything, it's don't touch your raw data. Instead of overwriting the original, export it to another directory or something

Question 5

Hello. Thank you for your responses. I posted answers to your question under "edit" in my post. Sure I can append the rows blindly without checking the original file. The only problems are: Updated file also contains headers on the first row, which would need be deleted/skipped, and the last line of the original file need be replaced by first line of the updated file.

Question 6

can you provide some duplicated data in your example data and explain what should happen with the duplicates depending on the condition?

Question 7

It is better to divide the problem in different parts: Here that's:

finding out what files need to be combined
combining the data
writing the output

Easiest to do this is using a chain of generators, akin to how in- and output is piped between different unix commands

current directory

For finding files, pathlib.Path is the easiest way in most cases

from pathlib import Path
data_dir = Path('.')

searching which files need to be combined:

re and itertools.groupby to the rescue

import re
import itertools
my_pattern = re.compile(r'Myfile(\d*)_(\d*)withdata(\d*)')
def find_files(data_dir):
 for file in data_dir.glob('*.csv'):
 yield my_pattern.match(file.name).groups(), file
def group_files(files):
 sorted_files = sorted(files)
 for group, data in itertools.groupby(sorted_files, key=lambda x: x[0][:2]):
 yield group, list(data)

This groups and sorts the data according to the numbers present

file_data = [
 (('', '2', ''), 'file0',),
 (('', '2', '1'), 'file1',),
 (('', '1', ''), 'file2',),
 (('', '1', '1'), 'file3',),
 (('1', '2', '1'), 'file4',),
 (('1', '2', '1'), 'file5',),
]
list(group_files(file_data))
[(('', '1'), [(('', '1', ''), 'file2'), (('', '1', '1'), 'file3')]),
 (('', '2'), [(('', '2', ''), 'file0'), (('', '2', '1'), 'file1')]),
 (('1', '2'), [(('1', '2', '1'), 'file4'), (('1', '2', '1'), 'file5')])]

Combining the data

def read_file(file):
 return pd.read_csv(file, index_col=0, skiprows=0, dtype=str, usecols=range(10), float_precision='high')

you can use DataFrame.update to update the info

def combine_files(grouped_files):
 for group, data in grouped_files:
 master_data_file = data.pop()[1] # The one without suffix will always be last
 master_data = read_file(master_data_file)
 for info, file in data:
 data = read_file(file)
 master_data = master_data.update(data, overwrite=True, raise_conflict=False)
 yield group, master_data

alternative combine

If, as @Mathias Ettinger states the update doesn't add new keys, you can try something like this:

def combine_files(grouped_files):
 dataframes = (read_file(filename) for group, (detail, filename) in grouped_files)
 result = pd.concat(dataframes).drop_duplicates(subset=<important_columns or index>, keep='last')
 yield group, result

This might mess with the column order. If this is a problem, you'll have to reindex the result with the wanted column order

write the results

def write_results(combined_data):
 for group, data in combined_data:
 filename = 'result_Myfile{}_{}withdata.csv'.format(*group)
 data.to_csv(filename)

Question 8

If I understand the documentation of DataFrame.update correctly, this won't add new timestamps to the original data, only update existing ones, right?

Question 9

I have not used that function for this purpose, and without sample data it's hard to test. overwrite : boolean, default True If True then overwrite values for common keys in the calling frame seems to suggest you can overwrite. If not, this part of the function should be replaced by a concat and drop_duplicates

Question 10

Yes, you can overwrite, it seems to be the point of the function, but it also seems you can't extend the original dataframe with new indexes; which, to me, would defeat the purpose of the script.

Question 11

Really sorry but I'm having a hard time using your functions correctly. Just trying to list the files gives an error " AttributeError: 'str' object has no attribute 'glob' ". I tried print([x for x in find_files( os.path.dirname(os.path.realpath(sys.argv[0])))]) and if using from pathlib import Path data_dir = Path('.') print([x for x in find_files(data_dir)]) it returns an empty list " [] "

Question 12

Path('.') normally points to the working directory from where you launched the script. If you need some other directory you'll need to adapt this path. You can see where it is pointed exactly by doing print(data_dir.resolve()). Instead of [x for x in <generator>] you can use list(<generator>). You can test the glob by doing list(p.glob(<pattern>)). You might have toadapt the pattern,possibly by adding **/ to it to make the glob recursive.

Question 13

Your method of retrieving the directory of the script is unusual, you may want to use the __file__ magic variable populated when importing or running your script instead of relying on the inspect module.

The use of glob.glob in a for loop seems like a waste of resources, you may want to consider glob.iglob instead.

You also happen to not use the values of the type_names dictionary, so why bother storing them? I would also turn the script into a function that accept both kind of names as parameters to ease reusability and testing.

Lastly, you use the currentpath variable when removing files but not when combining them, you may want to fix that.

import os
import glob
import pandas as pd
CURRENT_PATH = os.path.dirname(os.path.abspath(__file__))
def combine_csv(prefix):
 result = pd.DataFrame([])
 for file in glob.iglob(prefix + '*'):
 namedf = pd.read_csv(file, index_col=0, skiprows=0, dtype=str, usecols=range(10), float_precision='high')
 result = result.append(namedf)
 result = result[~result.index.duplicated(keep='last')]
 result.to_csv(prefix + '.csv')
def main(file_names=('MyFile', 'MyFile5'), type_names=('1withdata', '2withdata')):
 for name in file_names:
 for kind in type_names:
 file_prefix = '{}_{}'.format(name, kind)
 combine_csv(os.path.join(CURRENT_PATH, file_prefix))
 print('Saved', file_prefix + '.csv')
 for file in glob.iglob(os.path.join(CURRENT_PATH, '*.csv')):
 if os.path.splitext(file)[0][-1].isdigit():
 os.remove(file)
if __name__ == '__main__':
 main()

Question 14

Tried your code with 4 files = "Myfile_1withdata" / "Myfile_1withdata1" and "Myfile5_1withdata" / "Myfile5_1withdata1" The data in "Myfile" joined correctly in resulting .csv "Myfile_1withdata" but there's also another csv created "MyFile_2withdata" with nothing but empty string in it. The "Myfile5" data did not merge... "Myfile5_1withdata" is the same. And "MyFile5_2withdata" is created with empty string. All I would like to have left is just the "Myfile_1withdata" and "Myfile5_1withdata" and the files with 1 at the end of the name so "Myfile_1withdata1" and "Myfile5_1withdata1" deleted

Maarten Fabré 9,4001 gold badge15 silver badges27 bronze badges · Answer 1 · 2018-02-01 11:34:26Z

It is better to divide the problem in different parts: Here that's:

finding out what files need to be combined
combining the data
writing the output

Easiest to do this is using a chain of generators, akin to how in- and output is piped between different unix commands

current directory

For finding files, pathlib.Path is the easiest way in most cases

from pathlib import Path
data_dir = Path('.')

searching which files need to be combined:

re and itertools.groupby to the rescue

import re
import itertools
my_pattern = re.compile(r'Myfile(\d*)_(\d*)withdata(\d*)')
def find_files(data_dir):
 for file in data_dir.glob('*.csv'):
 yield my_pattern.match(file.name).groups(), file
def group_files(files):
 sorted_files = sorted(files)
 for group, data in itertools.groupby(sorted_files, key=lambda x: x[0][:2]):
 yield group, list(data)

This groups and sorts the data according to the numbers present

file_data = [
 (('', '2', ''), 'file0',),
 (('', '2', '1'), 'file1',),
 (('', '1', ''), 'file2',),
 (('', '1', '1'), 'file3',),
 (('1', '2', '1'), 'file4',),
 (('1', '2', '1'), 'file5',),
]
list(group_files(file_data))
[(('', '1'), [(('', '1', ''), 'file2'), (('', '1', '1'), 'file3')]),
 (('', '2'), [(('', '2', ''), 'file0'), (('', '2', '1'), 'file1')]),
 (('1', '2'), [(('1', '2', '1'), 'file4'), (('1', '2', '1'), 'file5')])]

Combining the data

def read_file(file):
 return pd.read_csv(file, index_col=0, skiprows=0, dtype=str, usecols=range(10), float_precision='high')

you can use DataFrame.update to update the info

def combine_files(grouped_files):
 for group, data in grouped_files:
 master_data_file = data.pop()[1] # The one without suffix will always be last
 master_data = read_file(master_data_file)
 for info, file in data:
 data = read_file(file)
 master_data = master_data.update(data, overwrite=True, raise_conflict=False)
 yield group, master_data

alternative combine

If, as @Mathias Ettinger states the update doesn't add new keys, you can try something like this:

def combine_files(grouped_files):
 dataframes = (read_file(filename) for group, (detail, filename) in grouped_files)
 result = pd.concat(dataframes).drop_duplicates(subset=<important_columns or index>, keep='last')
 yield group, result

This might mess with the column order. If this is a problem, you'll have to reindex the result with the wanted column order

write the results

def write_results(combined_data):
 for group, data in combined_data:
 filename = 'result_Myfile{}_{}withdata.csv'.format(*group)
 data.to_csv(filename)

If I understand the documentation of DataFrame.update correctly, this won't add new timestamps to the original data, only update existing ones, right?
I have not used that function for this purpose, and without sample data it's hard to test. overwrite : boolean, default True If True then overwrite values for common keys in the calling frame seems to suggest you can overwrite. If not, this part of the function should be replaced by a concat and drop_duplicates
Yes, you can overwrite, it seems to be the point of the function, but it also seems you can't extend the original dataframe with new indexes; which, to me, would defeat the purpose of the script.
Really sorry but I'm having a hard time using your functions correctly. Just trying to list the files gives an error " AttributeError: 'str' object has no attribute 'glob' ". I tried print([x for x in find_files( os.path.dirname(os.path.realpath(sys.argv[0])))]) and if using from pathlib import Path data_dir = Path('.') print([x for x in find_files(data_dir)]) it returns an empty list " [] "
Path('.') normally points to the working directory from where you launched the script. If you need some other directory you'll need to adapt this path. You can see where it is pointed exactly by doing print(data_dir.resolve()). Instead of [x for x in <generator>] you can use list(<generator>). You can test the glob by doing list(p.glob(<pattern>)). You might have toadapt the pattern,possibly by adding **/ to it to make the glob recursive.

301_Moved_Permanently 29.4k3 gold badges49 silver badges98 bronze badges · Answer 2 · 2018-02-01 09:45:48Z

Your method of retrieving the directory of the script is unusual, you may want to use the __file__ magic variable populated when importing or running your script instead of relying on the inspect module.

The use of glob.glob in a for loop seems like a waste of resources, you may want to consider glob.iglob instead.

You also happen to not use the values of the type_names dictionary, so why bother storing them? I would also turn the script into a function that accept both kind of names as parameters to ease reusability and testing.

Lastly, you use the currentpath variable when removing files but not when combining them, you may want to fix that.

import os
import glob
import pandas as pd
CURRENT_PATH = os.path.dirname(os.path.abspath(__file__))
def combine_csv(prefix):
 result = pd.DataFrame([])
 for file in glob.iglob(prefix + '*'):
 namedf = pd.read_csv(file, index_col=0, skiprows=0, dtype=str, usecols=range(10), float_precision='high')
 result = result.append(namedf)
 result = result[~result.index.duplicated(keep='last')]
 result.to_csv(prefix + '.csv')
def main(file_names=('MyFile', 'MyFile5'), type_names=('1withdata', '2withdata')):
 for name in file_names:
 for kind in type_names:
 file_prefix = '{}_{}'.format(name, kind)
 combine_csv(os.path.join(CURRENT_PATH, file_prefix))
 print('Saved', file_prefix + '.csv')
 for file in glob.iglob(os.path.join(CURRENT_PATH, '*.csv')):
 if os.path.splitext(file)[0][-1].isdigit():
 os.remove(file)
if __name__ == '__main__':
 main()

Tried your code with 4 files = "Myfile_1withdata" / "Myfile_1withdata1" and "Myfile5_1withdata" / "Myfile5_1withdata1" The data in "Myfile" joined correctly in resulting .csv "Myfile_1withdata" but there's also another csv created "MyFile_2withdata" with nothing but empty string in it. The "Myfile5" data did not merge... "Myfile5_1withdata" is the same. And "MyFile5_2withdata" is created with empty string. All I would like to have left is just the "Myfile_1withdata" and "Myfile5_1withdata" and the files with 1 at the end of the name so "Myfile_1withdata1" and "Myfile5_1withdata1" deleted

Stack Exchange Network

Using Pandas to merge .csv files

2 Answers 2

current directory

searching which files need to be combined:

Combining the data

alternative combine

write the results

You must log in to answer this question.

Hot Network Questions

Using Pandas to merge .csv files

2 Answers 2

current directory

searching which files need to be combined:

Combining the data

alternative combine

write the results

You must log in to answer this question.

Related

Hot Network Questions