Storing many pandas DataFrames in SQLite with metadata

Question 1

I need some help with designing a database. My aim is to persistently store a number of pandas DataFrames in a searchable way, and from what I've read SQLite is perfect for this task.

Each DataFrame contains about a million rows of particle movement data like this:

 z y x frame particle
0 49.724138 45.642857 813.035714 0 0
3789 14.345679 2820.537500 4245.162500 0 1
3788 10.692308 2819.210526 1646.842105 0 2
3787 34.100000 2817.700000 1375.300000 0 3
3786 8.244898 2819.729167 1047.375000 0 4

Using sqlalchemy I can already store each DataFrame as a table in a new DataBase:

from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("sqlite:////mnt/storage/test.db")
exp1.to_sql("exp1", engine, if_exists="replace")
exp2.to_sql("exp2", engine, if_exists="replace")
exp3.to_sql("exp3", engine, if_exists="replace")

But this is too basic. How can I store each DataFrame/experiment with a couple of metadata fields like Name, Date in such a way that later on it's possible to return all experiments conducted by a certain person, or on a specific date?

I will add more columns over time. Assuming each DataFrame/experiment has a column velocity, how could I retrieve all experiments where the mean temperature value is below or above an arbitrary threshold?

Question 2

You've created 3 separate tables (well 2, pending the apparent typo?). If you want to unify the data, you probably shouldn't be forcibly overwriting target tables with if_exists="replace"

From the .to_sql() documentation

replace: Drop the table before inserting new values.

append: Insert new values to the existing table.

Assuming your similarly named files have the same schema, you can edit the last 3 lines as follows.

exp1.to_sql("exp", engine, if_exists="append")
exp2.to_sql("exp", engine, if_exists="append")
exp3.to_sql("exp", engine, if_exists="append")

This will insert all three datasets to a single table named exp instead of 3 separate tables.

If each csv isn't uniquely identified from the others within itself - for example if exp1.csv looks like this...

Name,Date,Temperature
Peter,2020年01月01日,50
Paul,2020年01月01日,55
Mary,2020年01月01日,53
Jane,2020年01月01日,49

...then you can append the experiment identifier to each dataset as needed in the dataframe. For example by...

>>> exp1['ExpName'] = 'exp1'
>>> exp1
 Name Date Temperature ExpName
0 Peter 2020年01月01日 50 exp1
1 Paul 2020年01月01日 55 exp1
2 Mary 2020年01月01日 53 exp1
3 Jane 2020年01月01日 49 exp1
>>>

...which will allow you to group by experiment in any follow-on SQL you may run against your database.

...how could I retrieve all experiments where the mean temperature value is below or above an arbitrary threshold?

...well given an arbitrary additional two datasets of...

➜ /tmp cat exp2.csv
Name,Date,Temperature
Peter,2020年01月02日,51
Paul,2020年01月02日,56
Mary,2020年01月02日,54
Jane,2020年01月02日,50
➜ /tmp cat exp3.csv
Name,Date,Temperature
Peter,2020年01月02日,52
Paul,2020年01月02日,57
Mary,2020年01月02日,55
Jane,2020年01月02日,51
➜ /tmp

...that you likewise appended the expN identifier to in the dataframe, then you would run the following SQL to retrieve experiments where the average temp was below 53

SELECT 
 ExpName,
 AVG(Temperature)
FROM exp
GROUP BY ExpName
HAVING AVG(Temperature) < 53;

Which I'll leave to you to plug into SQLAlchemy as you like :)

Question 3

Thanks! So there is no good way to store "single field" values? Filling a whole column with the same entry a million times seems sort of unclean. Apart from that this is much easier to work with than what I had before.

Question 4

The "append experiment" line is based on the initial stated premise of multiple csv files and was included in the event that your dataframes do not already uniquely isolate experiments within themselves. Appending values conditionally to a dataframe is likely another question (for SO, not dba.se), but it's also probably an answered question

Question 5

The DataFrames do not have a unique identifier at this time. With pandas, as far as I can see, the only way to add such an identifier is to add a whole column with the same value, like you did. I thought maybe since I'm moving to SQLite now that there might be a better solution, e.g. that it's possible to add "date" or "name" as a single field, as metadata basically.

Question 6

According to this, there is no one perfect way to do it and the answer there actually supports your way of adding a whole column.

score 2 · Accepted Answer · 2020-03-03 17:04:52Z

You've created 3 separate tables (well 2, pending the apparent typo?). If you want to unify the data, you probably shouldn't be forcibly overwriting target tables with if_exists="replace"

From the .to_sql() documentation

replace: Drop the table before inserting new values.

append: Insert new values to the existing table.

Assuming your similarly named files have the same schema, you can edit the last 3 lines as follows.

exp1.to_sql("exp", engine, if_exists="append")
exp2.to_sql("exp", engine, if_exists="append")
exp3.to_sql("exp", engine, if_exists="append")

This will insert all three datasets to a single table named exp instead of 3 separate tables.

If each csv isn't uniquely identified from the others within itself - for example if exp1.csv looks like this...

Name,Date,Temperature
Peter,2020年01月01日,50
Paul,2020年01月01日,55
Mary,2020年01月01日,53
Jane,2020年01月01日,49

...then you can append the experiment identifier to each dataset as needed in the dataframe. For example by...

>>> exp1['ExpName'] = 'exp1'
>>> exp1
 Name Date Temperature ExpName
0 Peter 2020年01月01日 50 exp1
1 Paul 2020年01月01日 55 exp1
2 Mary 2020年01月01日 53 exp1
3 Jane 2020年01月01日 49 exp1
>>>

...which will allow you to group by experiment in any follow-on SQL you may run against your database.

...how could I retrieve all experiments where the mean temperature value is below or above an arbitrary threshold?

...well given an arbitrary additional two datasets of...

➜ /tmp cat exp2.csv
Name,Date,Temperature
Peter,2020年01月02日,51
Paul,2020年01月02日,56
Mary,2020年01月02日,54
Jane,2020年01月02日,50
➜ /tmp cat exp3.csv
Name,Date,Temperature
Peter,2020年01月02日,52
Paul,2020年01月02日,57
Mary,2020年01月02日,55
Jane,2020年01月02日,51
➜ /tmp

...that you likewise appended the expN identifier to in the dataframe, then you would run the following SQL to retrieve experiments where the average temp was below 53

SELECT 
 ExpName,
 AVG(Temperature)
FROM exp
GROUP BY ExpName
HAVING AVG(Temperature) < 53;

Which I'll leave to you to plug into SQLAlchemy as you like :)

Thanks! So there is no good way to store "single field" values? Filling a whole column with the same entry a million times seems sort of unclean. Apart from that this is much easier to work with than what I had before.
The "append experiment" line is based on the initial stated premise of multiple csv files and was included in the event that your dataframes do not already uniquely isolate experiments within themselves. Appending values conditionally to a dataframe is likely another question (for SO, not dba.se), but it's also probably an answered question
The DataFrames do not have a unique identifier at this time. With pandas, as far as I can see, the only way to add such an identifier is to add a whole column with the same value, like you did. I thought maybe since I'm moving to SQLite now that there might be a better solution, e.g. that it's possible to add "date" or "name" as a single field, as metadata basically.
According to this, there is no one perfect way to do it and the answer there actually supports your way of adding a whole column.

Stack Exchange Network

Storing many pandas DataFrames in SQLite with metadata

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Storing many pandas DataFrames in SQLite with metadata

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions