16

I'm seeking to find a way to generate a new MySQL table solely based on the contents of a specified CSV. The CSV files I'll be using have the following properties;

  • "|" delimited.
  • First row specifies the column names (headers), also "|" delimited.
  • Column names & order are not fixed.
  • The number of columns is not fixed.
  • Files are of a large size (1 mil rows / 50 columns).

In Excel this is all rather simple, however with MySQL it does not appear to be (no luck with Google). Any suggestions on what I should be looking at?

Andriy M
23.3k6 gold badges60 silver badges104 bronze badges
asked Feb 14, 2015 at 20:37

3 Answers 3

15

You can use csvsql, which is part of csvkit (a suite of utilities for converting to and working with CSV files):

  • Linux or Mac OS X
  • free and open source
  • sudo pip install csvkit
  • Example: csvsql --dialect mysql --snifflimit 100000 datatwithheaders.csv > mytabledef.sql
  • It creates a CREATE TABLE statement based on the file content. Column names are taken from the first line of the CSV file.

To extend on ivansabik's answer using pandas, see How to insert pandas dataframe via mysqldb into database?.

answered Dec 24, 2015 at 19:32
1
  • csvsql is too slow for a reasonably large file. In my case, a 7.8M csv file takes 4+ minutes to finish. Commented Sep 30, 2020 at 20:26
3

If you're ok with using Python, Pandas worked great for me (csvsql hanged forever and less cols and rows than in your case). Something like:

from sqlalchemy import create_engine
import pandas as pd
df = pd.read_csv('/PATH/TO/FILE.csv', sep='|')
# Optional, set your indexes to get Primary Keys
df = df.set_index(['COL A', 'COL B'])
engine = create_engine('mysql://user:pass@host/db', echo=False)
df.to_sql(table_name, engine, index=False)
answered Mar 28, 2017 at 4:18
3
  • Where do you define dwh_engine? Is this a typo and you meant engine? Commented Mar 28, 2017 at 7:00
  • Yes it should be engine! Corrected the answer thanks for spotting Commented Mar 28, 2017 at 21:37
  • to_sql takes up too much time if the number of rows is high. For us, around 36000 rows took around 90 mins. A direct load statement was done in 3 seconds. Commented Dec 3, 2018 at 10:32
0

You need to generate a CREATE TABLE based on datatypes, size, etc of the various columns.

Then you use LOAD DATA INFILE ... FIELDS TERMINATED BY '|' LINES TERMINATED BY "\n" SKIP 1 LINE ...; (See the manual page for details.)

Do likewise for each csv --> table.

answered Feb 14, 2015 at 23:01

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.