I'm seeking to find a way to generate a new MySQL table solely based on the contents of a specified CSV. The CSV files I'll be using have the following properties;
- "|" delimited.
- First row specifies the column names (headers), also "|" delimited.
- Column names & order are not fixed.
- The number of columns is not fixed.
- Files are of a large size (1 mil rows / 50 columns).
In Excel this is all rather simple, however with MySQL it does not appear to be (no luck with Google). Any suggestions on what I should be looking at?
3 Answers 3
You can use csvsql, which is part of csvkit
(a suite of utilities for converting to and working with CSV files):
- Linux or Mac OS X
- free and open source
sudo pip install csvkit
- Example:
csvsql --dialect mysql --snifflimit 100000 datatwithheaders.csv > mytabledef.sql
- It creates a
CREATE TABLE
statement based on the file content. Column names are taken from the first line of the CSV file.
To extend on ivansabik's answer using pandas
, see How to insert pandas dataframe via mysqldb into database?.
-
csvsql is too slow for a reasonably large file. In my case, a 7.8M csv file takes 4+ minutes to finish.Gang Liang– Gang Liang2020年09月30日 20:26:16 +00:00Commented Sep 30, 2020 at 20:26
If you're ok with using Python, Pandas worked great for me (csvsql hanged forever and less cols and rows than in your case). Something like:
from sqlalchemy import create_engine
import pandas as pd
df = pd.read_csv('/PATH/TO/FILE.csv', sep='|')
# Optional, set your indexes to get Primary Keys
df = df.set_index(['COL A', 'COL B'])
engine = create_engine('mysql://user:pass@host/db', echo=False)
df.to_sql(table_name, engine, index=False)
-
Where do you define
dwh_engine
? Is this a typo and you meantengine
?joanolo– joanolo2017年03月28日 07:00:19 +00:00Commented Mar 28, 2017 at 7:00 -
Yes it should be
engine
! Corrected the answer thanks for spottingivansabik– ivansabik2017年03月28日 21:37:06 +00:00Commented Mar 28, 2017 at 21:37 -
to_sql takes up too much time if the number of rows is high. For us, around 36000 rows took around 90 mins. A direct load statement was done in 3 seconds.mvinayakam– mvinayakam2018年12月03日 10:32:46 +00:00Commented Dec 3, 2018 at 10:32
You need to generate a CREATE TABLE based on datatypes, size, etc of the various columns.
Then you use LOAD DATA INFILE ... FIELDS TERMINATED BY '|' LINES TERMINATED BY "\n" SKIP 1 LINE ...; (See the manual page for details.)
Do likewise for each csv --> table.