Hope you can help me out with this one because it is really slow. Is there a way to do this without loading the whole .csv into memory?
The thing is... I have files containing timeseries data with 10 columns. First column is a datetime, last an integer, and the rest are floats
I am trying to join two .csv files together. The filenames are:
- Myfile_1withdata
- Myfile_1withdata1
- Myfile_2withdata
- Myfile_2withdata1
- Myfile5_1withdata
- Myfile5_1withdata1
etc...
The files with a "1" at the end is the new file containing updated data that I want to add (append) to files without 1 at the end like "Myfile5_1withdata.csv"
Files can weight up to 500MB and there are many of them and it takes a long time to finish this process... Can it be faster?
Currently I have tried accomplish this by doing:
import inspect
import pandas as pd
import glob, os
currentpath = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
type_names = {'1withdata':"super",'2withdata':"extra"}
file_names = ["Myfile","Myfile5"]
for a in file_names:
for x in type_names.keys():
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob(str(a)+'_'+str(x)+"*")):
namedf = pd.read_csv(file, index_col=0,skiprows=0,dtype=str, usecols=[0,1,2,3,4,5,6,7,8,9],float_precision='high')
results = results.append(namedf) # Dataframe with data of all file_names files with the same type_names key
print("saving: ",a,x)
results = results[~results.index.duplicated(keep='last')] #Remove duplicate row (last row with incomplete timeseries data)
results.to_csv(a+'_'+str(x)+'.csv')
print("DONE!")
#Cleanup by deleting data files with updated data (the ones ending with numbers)
files = [file for file in glob.glob(currentpath+"//*.csv") if file[-5:-4].isdigit() == True]
for file in files:
os.remove(file)
@Edit 1 :
Here's an example of the data inside the files:
Myfile_1withdata.csv
The_Date_Time,Float1,Float2,Float3,Float4,Float5,Float6,Float7,Float8,Integer
31/10/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
30/11/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
31/12/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
31/01/2002 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
28/02/2002 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
Myfile_1withdata1.csv
The_Date_Time,Float1,Float2,Float3,Float4,Float5,Float6,Float7,Float8,Integer
28/02/2002 22:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/03/2002 22:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
30/04/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/05/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
30/06/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/07/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
So after the whole operation the Myfile_1withdata.csv should look like:
The_Date_Time,Float1,Float2,Float3,Float4,Float5,Float6,Float7,Float8,Integer
31/10/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
30/11/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
31/12/2001 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
31/01/2002 22:00:00,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,1.11111,123456
28/02/2002 22:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/03/2002 22:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
30/04/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/05/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
30/06/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
31/07/2002 21:00:00,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,1.22222,678910
2 Answers 2
It is better to divide the problem in different parts: Here that's:
- finding out what files need to be combined
- combining the data
- writing the output
Easiest to do this is using a chain of generators, akin to how in- and output is piped between different unix commands
current directory
For finding files, pathlib.Path is the easiest way in most cases
from pathlib import Path
data_dir = Path('.')
searching which files need to be combined:
re and itertools.groupby to the rescue
import re
import itertools
my_pattern = re.compile(r'Myfile(\d*)_(\d*)withdata(\d*)')
def find_files(data_dir):
for file in data_dir.glob('*.csv'):
yield my_pattern.match(file.name).groups(), file
def group_files(files):
sorted_files = sorted(files)
for group, data in itertools.groupby(sorted_files, key=lambda x: x[0][:2]):
yield group, list(data)
This groups and sorts the data according to the numbers present
file_data = [ (('', '2', ''), 'file0',), (('', '2', '1'), 'file1',), (('', '1', ''), 'file2',), (('', '1', '1'), 'file3',), (('1', '2', '1'), 'file4',), (('1', '2', '1'), 'file5',), ] list(group_files(file_data)) [(('', '1'), [(('', '1', ''), 'file2'), (('', '1', '1'), 'file3')]), (('', '2'), [(('', '2', ''), 'file0'), (('', '2', '1'), 'file1')]), (('1', '2'), [(('1', '2', '1'), 'file4'), (('1', '2', '1'), 'file5')])]
Combining the data
def read_file(file):
return pd.read_csv(file, index_col=0, skiprows=0, dtype=str, usecols=range(10), float_precision='high')
you can use DataFrame.update to update the info
def combine_files(grouped_files):
for group, data in grouped_files:
master_data_file = data.pop()[1] # The one without suffix will always be last
master_data = read_file(master_data_file)
for info, file in data:
data = read_file(file)
master_data = master_data.update(data, overwrite=True, raise_conflict=False)
yield group, master_data
alternative combine
If, as @Mathias Ettinger states the update doesn't add new keys, you can try something like this:
def combine_files(grouped_files):
dataframes = (read_file(filename) for group, (detail, filename) in grouped_files)
result = pd.concat(dataframes).drop_duplicates(subset=<important_columns or index>, keep='last')
yield group, result
This might mess with the column order. If this is a problem, you'll have to reindex the result with the wanted column order
write the results
def write_results(combined_data):
for group, data in combined_data:
filename = 'result_Myfile{}_{}withdata.csv'.format(*group)
data.to_csv(filename)
-
\$\begingroup\$ If I understand the documentation of
DataFrame.updatecorrectly, this won't add new timestamps to the original data, only update existing ones, right? \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2018年02月01日 13:31:46 +00:00Commented Feb 1, 2018 at 13:31 -
\$\begingroup\$ I have not used that function for this purpose, and without sample data it's hard to test.
overwrite : boolean, default True If True then overwrite values for common keys in the calling frameseems to suggest you can overwrite. If not, this part of the function should be replaced by aconcatanddrop_duplicates\$\endgroup\$Maarten Fabré– Maarten Fabré2018年02月01日 13:55:51 +00:00Commented Feb 1, 2018 at 13:55 -
\$\begingroup\$ Yes, you can overwrite, it seems to be the point of the function, but it also seems you can't extend the original dataframe with new indexes; which, to me, would defeat the purpose of the script. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2018年02月01日 14:23:11 +00:00Commented Feb 1, 2018 at 14:23
-
\$\begingroup\$ Really sorry but I'm having a hard time using your functions correctly. Just trying to list the files gives an error " AttributeError: 'str' object has no attribute 'glob' ". I tried print([x for x in find_files( os.path.dirname(os.path.realpath(sys.argv[0])))]) and if using from pathlib import Path data_dir = Path('.') print([x for x in find_files(data_dir)]) it returns an empty list " [] " \$\endgroup\$Cactus– Cactus2018年02月01日 19:20:24 +00:00Commented Feb 1, 2018 at 19:20
-
\$\begingroup\$
Path('.')normally points to the working directory from where you launched the script. If you need some other directory you'll need to adapt this path. You can see where it is pointed exactly by doingprint(data_dir.resolve()). Instead of[x for x in <generator>]you can uselist(<generator>). You can test the glob by doinglist(p.glob(<pattern>)). You might have toadapt the pattern,possibly by adding**/to it to make the glob recursive. \$\endgroup\$Maarten Fabré– Maarten Fabré2018年02月01日 22:15:31 +00:00Commented Feb 1, 2018 at 22:15
Your method of retrieving the directory of the script is unusual, you may want to use the __file__ magic variable populated when importing or running your script instead of relying on the inspect module.
The use of glob.glob in a for loop seems like a waste of resources, you may want to consider glob.iglob instead.
You also happen to not use the values of the type_names dictionary, so why bother storing them? I would also turn the script into a function that accept both kind of names as parameters to ease reusability and testing.
Lastly, you use the currentpath variable when removing files but not when combining them, you may want to fix that.
import os
import glob
import pandas as pd
CURRENT_PATH = os.path.dirname(os.path.abspath(__file__))
def combine_csv(prefix):
result = pd.DataFrame([])
for file in glob.iglob(prefix + '*'):
namedf = pd.read_csv(file, index_col=0, skiprows=0, dtype=str, usecols=range(10), float_precision='high')
result = result.append(namedf)
result = result[~result.index.duplicated(keep='last')]
result.to_csv(prefix + '.csv')
def main(file_names=('MyFile', 'MyFile5'), type_names=('1withdata', '2withdata')):
for name in file_names:
for kind in type_names:
file_prefix = '{}_{}'.format(name, kind)
combine_csv(os.path.join(CURRENT_PATH, file_prefix))
print('Saved', file_prefix + '.csv')
for file in glob.iglob(os.path.join(CURRENT_PATH, '*.csv')):
if os.path.splitext(file)[0][-1].isdigit():
os.remove(file)
if __name__ == '__main__':
main()
-
\$\begingroup\$ Tried your code with 4 files = "Myfile_1withdata" / "Myfile_1withdata1" and "Myfile5_1withdata" / "Myfile5_1withdata1" The data in "Myfile" joined correctly in resulting .csv "Myfile_1withdata" but there's also another csv created "MyFile_2withdata" with nothing but empty string in it. The "Myfile5" data did not merge... "Myfile5_1withdata" is the same. And "MyFile5_2withdata" is created with empty string. All I would like to have left is just the "Myfile_1withdata" and "Myfile5_1withdata" and the files with 1 at the end of the name so "Myfile_1withdata1" and "Myfile5_1withdata1" deleted \$\endgroup\$Cactus– Cactus2018年02月01日 18:55:24 +00:00Commented Feb 1, 2018 at 18:55
You must log in to answer this question.
Explore related questions
See similar questions with these tags.
#Cleanup by deleting data files with updated data (the ones ending with numbers)If all my expierience with handling data has learned me anything, it'sdon't touch your raw data. Instead of overwriting the original, export it to another directory or something \$\endgroup\$