I have various pd.DataFrames that I'd like to write to an hdf store by passing them to a function. Is there a way to programmatically generate key names based on the variable name of any given dataframe?
from sklearn import datasets
import pandas as pd
df1 = pd.DataFrame(datasets.load_iris().data)
df2 = pd.DataFrame(datasets.load_boston().data)
def save_to_hdf(df1):
with pd.HDFStore('test.h5') as store:
store.put('df1', df1)
save_to_hdf(df1)
asked Feb 17, 2018 at 1:48
anon01
11.2k8 gold badges41 silver badges64 bronze badges
1 Answer 1
You should do it like np.savez() does it:
def save_to_hdf(filename, **kwargs):
with pd.HDFStore(filename) as store:
for name, df in kwargs.items():
store.put(name, df)
save_to_hdf('test.h5', df1=df1, another_name=df2)
This is more efficient: it only needs to open the file once to write as many arrays as you want. And you can use names that are different to the variables.
You can avoid having to name the variables twice by using a dict:
dfs = {
'iris': pd.DataFrame(datasets.load_iris().data),
'boston': pd.DataFrame(datasets.load_boston().data),
}
save_to_hdf('test.h5', **dfs)
answered Feb 17, 2018 at 2:01
John Zwinck
252k44 gold badges347 silver badges459 bronze badges
Sign up to request clarification or add additional context in comments.
3 Comments
John Zwinck
@ConfusinglyCuriousTheThird: Programmatically creating names in files based on variables in Python is a very bad idea. Any Python programmer would be surprised to see such behavior, and you should drop the idea in favor of something more clear and explicit, like the above.
anon01
I agree. My question is, I suppose: is there a better alternative to maintaining a list of strings corresponding to variables of the same name?
John Zwinck
@ConfusinglyCuriousTheThird: I've added to my answer to show how you can avoid duplicate names in your code. Just store the data all together from the beginning.
lang-py
df1, both the name, and the actual DataFrame? It's a tiny bit more work, but it makes thing much clearer. Or use a dict, like{'df1': df1, 'df2': df2}, and iterate over the items. It's also more flexible.globals()['df1']to get the relevant DataFrame, but I wouldn't recommend it.