I need to load 4 million rows of data into a MySQL InnoDB table using LOAD DATA INFILE and would like to know if there are server configuration options I can tweak to get faster load.
It took me 15 minutes to loaded 2 million rows, a performance I thought was disappointing for the LOAD DATA INFILE. My statement looks like this
LOAD DATA LOCAL INFILE 'path/file.csv' INTO TABLE table FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' IGNORE 1 LINES (column1, column2, etc);
2 Answers 2
Although LOAD DATA INFILE can work against InnoDB, there are too many ways InnoDB gets tapped to its limits before swapping and bottlenecks takeover.
Here is a Pictorial Representation of InnoDB (from Percona CTO Vadim Tkachenko)
InnoDB Plumbing
The bottlenecks would be goring through the following structures
- InnoDB Buffer Pool
- Transaction Logs (ib_lofile0, ib_logfile1)
- Double Write Buffer
- Insert Buffer
- One Rollback Segment
- Log Buffer
Here are some of my past posts where I discuss LOAD DATA INFILE with InnoDB
Feb 06, 2012
: LOAD DATA (400k rows) INFILE takes about 7 minutes, cannot kill the "logging slow query" process?Jan 11, 2013
: MySQL LOAD DATA INFILE slows by 80% after a few gigs of input with InnoDB engineJan 12, 2013
: What does 'system lock' mean in mysql profiling a LOAD DATA INFILE statement?
SUGGESTION #1
Break up the file into 20 smaller files.
Instead of one LOAD DATA INFILE
against a 2 million row file, perform 20 LOAD DATA INFILE
against 20 files, each with 100 thousand rows.
The Benefit : Less pressure against the InnoDB Plumbing
SUGGESTION #2 (Optional)
- Increase the Log Buffer (innodb_log_buffer_size = 256M)
- Increase the Write Threads (innodb_write_io_threads = 16)
- Increase the InnoDB Buffer Pool Size (innodb_buffer_pool_size)
I'll bet that you are currently I/O bound. This means that nothing can speed it up. (And Rolando's suggestions may be futile.)
Let's look deeper. Is this LOAD a recurring task? If so, how often? Is everything blocked waiting for table to be reloaded? Simple solution: Load into a different table, then do a double RENAME TABLE
to swap it in. Only milliseconds of downtime.
Is the data coming from another machine? Use the network for the "input" side of the LOAD
rather than having the one disk fighting for reads versus writes.
Do you have a lot of indexes? There are several directions to take this question. Let's see SHOW CREATE TABLE
before barking up these tree(s).
Does the entire load need to be a single transaction? Multiple transactions may be faster because of not overflowing the log file. (I've seen 2x.)