Significant processing time creating indexes on a large MySQL table

Question 1

I'm creating indexes on a few tables I've recently created within MySQL.

The first (call below) completed within ~20 minutes and was applied to a table with a little over 20M records:

mysql> alter table fu.myTable add index(date), add index(id), add index(symbol), add index(method), algorithm=inplace, lock=none;
Query OK, 0 rows affected (20 min 20.52 sec)
Records: 0 Duplicates: 0 Warnings: 0

The second, applied to a much larger table consisting of nearly 1.1B rows, is taking considerably longer (~19 hours in at the moment). Sorting is obviously an O(n*log(n)) operation, so for a table 50x the size, by my calcs, processing on the larger table would be expected to take ~22 hours. Call I'm using for it below:

mysql> alter table op.myOtherTable add index(dataDate), add index(ticker), add index(symbol), add index(expDate), algorithm=inplace, lock=none;

With that, I wonder if there are any settings I may be able to adjust to speed things up some given my machine doesn't appear to be overly taxed at this point.

Edit: The dataset consists of ~20 years of historical data and is parsed into daily CSVs (of ~+/-1M rows) to make it a little easier to work with. I've successfully gotten the data into MySQL but now need to apply indexes to make it usable. Fields I will likely search on are those I'm indexing on (dataDate, ticker, symbol, expDate). Additional detail below:

mysql> select table_rows from information_schema.tables where table_name = 'options';
+------------+
| table_rows |
+------------+
| 1097863845 |
+------------+
1 row in set (2.70 sec)
mysql> show create table op.options;
+---------+------------------------------------------------------------+
| Table | Create Table |
+---------+------------------------------------------------------------+
| options | CREATE TABLE `options` (
 `dataDate` date NOT NULL,
 `ticker` varchar(8) NOT NULL,
 `symbol` varchar(30) NOT NULL,
 `expDate` date NOT NULL,
 `type` char(4) NOT NULL,
 `price` double DEFAULT NULL,
 `strike` double DEFAULT NULL,
 `last` double DEFAULT NULL,
 `bid` double DEFAULT NULL,
 `ask` double DEFAULT NULL,
 `volume` int(11) DEFAULT NULL,
 `OI` int(11) DEFAULT NULL,
 `IV` double DEFAULT NULL,
 `delta` double DEFAULT NULL,
 `gamma` double DEFAULT NULL,
 `theta` double DEFAULT NULL,
 `vega` double DEFAULT NULL,
 `aka` varchar(30) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+----------------------------------------------------------------------+
1 row in set (0.01 sec)

Problem table layout sample:

mysql> select * from op.options limit 10;
+------------+--------+------------------+------------+------+-------+--------+------+------+------+--------+------+--------+---------+--------+---------+--------+------+
| dataDate | ticker | optionSymbol | expDate | type | price | strike | last | bid | ask | volume | OI | IV | delta | gamma | theta | vega | aka |
+------------+--------+------------------+------------+------+-------+--------+------+------+------+--------+------+--------+---------+--------+---------+--------+------+
| 2003年02月26日 | A | A030322C00010000 | 2003年03月22日 | call | 12.8 | 10 | 3.7 | 2.8 || 3.1 | 0 | 66 | 0.8389 | 0.8987 | 0.0653 | -3.8774 | 0.5735 | ACB
| 2003年02月26日 | A | A030322P00010000 | 2003年03月22日 | put | 12.8 | 10 | 0.15 | 0. || 0.2 | 0 | 1436 | 0.8448 | -0.1027 | 0.0654 | -3.8059 | 0.5789 | AOB
| 2003年02月26日 | A | A030322C00012500 | 2003年03月22日 | call | 12.8 | 12.5 | 1.1 | 0.9 || 1.05 | 0 | 6274 | 0.6549 | 0.5914 | 0.1833 | -6.5268 | 1.2571 | ACV
| 2003年02月26日 | A | A030322P00012500 | 2003年03月22日 | put | 12.8 | 12.5 | 0.6 | 0.6 || 0.75 | 75 | 929 | 0.6644 | -0.4089 | 0.1807 | -6.4545 | 1.2573 | AOV
| 2003年02月26日 | A | A030322C00015000 | 2003年03月22日 | call | 12.8 | 15 | 0.15 | 0. || 0.2 | 45 | 903 | 0.5812 | 0.1586 | 0.1286 | -3.5849 | 0.7831 | ACC
| 2003年02月26日 | A | A030322P00015000 | 2003年03月22日 | put | 12.8 | 15 | 2.15 | 2.2 || 2.4 | 0 | 107 | 0.558 | -0.8526 | 0.1276 | -3.0784 | 0.7459 | AOC
| 2003年02月26日 | A | A030322C00017500 | 2003年03月22日 | call | 12.8 | 17.5 | 0.05 | || 0.1 | 0 | 439 | 0.7012 | 0.0474 | 0.0435 | -1.7616 | 0.3198 | ACW
| 2003年02月26日 | A | A030322P00017500 | 2003年03月22日 | put | 12.8 | 17.5 | 5 | 4. || 4.8 | 0 | 215 | 0.558 | -0.9838 | 0.0224 | -0.3395 | 0.1307 | AOW
| 2003年02月26日 | A | A030322C00020000 | 2003年03月22日 | call | 12.8 | 20 | 0.05 | || 0.05 | 0 | 224 | 0.8113 | 0.0193 | 0.0179 | -0.9668 | 0.1519 | ACD
| 2003年02月26日 | A | A030322P00020000 | 2003年03月22日 | put | 12.8 | 20 | 7.1 | 7. || 7.3 | 0 | 6 | 0.558 | -0.999 | 0.0019 | 0.2187 | 0.011 | AOD
+------------+--------+------------------+------------+------+-------+--------+------+------+------+--------+------+--------+---------+--------+---------+--------+------+
10 rows in set (0.44 sec)

Question 2

If id is the PRIMARY KEY, don't add a redundant INDEX(id).

Question 3

@RickJames, I created the table without indexes or keys to begin with because loading the data (housed in separate daily CSVs) required a reindex with completion of each CSV. Adding indexes now that all data is in the table

Question 4

So, the table was unusable during the load?

Question 5

@RickJames, Unusable? It was just taking for ever. The dataset consists of 20 years of daily data, with about 1M records per day (~4400 files). Creating the table with indexes to start required a reindex with the load of each file; loading absent indexes and then applying indexes to the entire dataset at once seemed preferable.

Question 6

Without indexes, you can't really use a multi-million row table for anything other than full table scans.

Question 7

Every day you bulk-insert 1M rows to add to a 1B-row table, correct?
There are several indexes on the table?
The goal is to add the rows reasonably fast?
The table is otherwise "readonly", so we are not worrying about writes?
The table is InnoDB, I hope.

Then...

Leave the indexes in place. The table is growing by only 0.1% each day. It would be faster to incrementally update the indexes than to drop and rebuild them.

InnoDB indexes are B+Trees.
Inserting one row, for example, will augment the B+Tree for each index.
The "Change buffer" delays the actual I/O involved in each index update, thereby making it not as bad as you might imagine.
Keep innodb_buffer_pool_size at a suitable size. (How much RAM do you have. The simple default of 70% of RAM may be close to optimal.) (The Change buffer is in the buffer_pool.)

Schema

For a billion-row table, space is important...

symbol and ticker? Do you need both? Consider normalizing the pair.
DOUBLE (8 bytes, 16 significant digits) is probably overkill for gamma, etc. Consider FLOAT (4 bytes, 7 significant digits)
Unless you have some limitations, volume can exceed a 32-bit INT (2B limit) or even INT UNSIGNED (4B limit)
I don't see id. If it is an AUTO_INCREMENT you add later, then I suggest you are risking overflow if it is INT.
A 'good' PRIMARY KEY would be (date, symbol_id) or possibly the opposite order.
What SELECTs do you have? If you are fetching "all info for one ticker", (symbol_id, date) is optimal. If you are talking about "today's data" then the opposite is better.
type could be an ENUM (1 byte)

INDEX(a), INDEX(b) is not the same as INDEX(a,b). The latter is significantly better in some situations. For querying a billion rows, you need to understand the differences. http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Question 8

Believe the above answers your Qs. I've upped innodb_buffer_pool_size because I have extra RAM as a 'just in case'--it hasn't seemed to help with indexing. I've also tried adjusting the read_buffer_size (2x), sort_buffer_size (2x) and tmp_table_size (4x) variables without significant effect.

Question 9

@Chris - Those other 3 buffer_sizes are not relevant to ADD INDEX. The buffer_pool_size is important for the change_buffer (incremental updates) and for general reads/writes. The bigger table is ~300GB? And you have less RAM than that?

Question 10

ADD INDEX mostly works with disk.

Question 11

Table in question is something like that. I have 16GB RAM total, usage is pushing 10 at the moment with the expansion of innodb_buffer_pool_size. All of this is being run off an external disk (moved mysql installation due to space constraints on my machine's SSD) where I have plenty of disk space (~2T+)

Question 12

@Chris - Good. Someday you may need that 2T to do an ALTER that can't be done INPLACE.

Question 13

If your goal is to replace a table with new data, I suggest the following:

CREATE TABLE new;
load the csv data
Cleanse the data (if needed)
ALTER TABLE new ADD INDEX ... ENGINE=InnoDB;
cleanse the data (if needed); fix datatypes for columns, etc
RENAME TALE real TO old, new TO real;
DROP TABLE old;

Step 6 is virtually instantaneous, so applications can always see the data.

If it takes 22 hours, be sure that it is not still running when tomorrow's copy is being loaded. That is have some sort of lock on the script.

Rick James Rick James 80.7k5 gold badges52 silver badges119 bronze badges · Answer 1 · 2020-03-15 04:53:09Z

Every day you bulk-insert 1M rows to add to a 1B-row table, correct?
There are several indexes on the table?
The goal is to add the rows reasonably fast?
The table is otherwise "readonly", so we are not worrying about writes?
The table is InnoDB, I hope.

Then...

Leave the indexes in place. The table is growing by only 0.1% each day. It would be faster to incrementally update the indexes than to drop and rebuild them.

InnoDB indexes are B+Trees.
Inserting one row, for example, will augment the B+Tree for each index.
The "Change buffer" delays the actual I/O involved in each index update, thereby making it not as bad as you might imagine.
Keep innodb_buffer_pool_size at a suitable size. (How much RAM do you have. The simple default of 70% of RAM may be close to optimal.) (The Change buffer is in the buffer_pool.)

Schema

For a billion-row table, space is important...

symbol and ticker? Do you need both? Consider normalizing the pair.
DOUBLE (8 bytes, 16 significant digits) is probably overkill for gamma, etc. Consider FLOAT (4 bytes, 7 significant digits)
Unless you have some limitations, volume can exceed a 32-bit INT (2B limit) or even INT UNSIGNED (4B limit)
I don't see id. If it is an AUTO_INCREMENT you add later, then I suggest you are risking overflow if it is INT.
A 'good' PRIMARY KEY would be (date, symbol_id) or possibly the opposite order.
What SELECTs do you have? If you are fetching "all info for one ticker", (symbol_id, date) is optimal. If you are talking about "today's data" then the opposite is better.
type could be an ENUM (1 byte)

INDEX(a), INDEX(b) is not the same as INDEX(a,b). The latter is significantly better in some situations. For querying a billion rows, you need to understand the differences. http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Believe the above answers your Qs. I've upped innodb_buffer_pool_size because I have extra RAM as a 'just in case'--it hasn't seemed to help with indexing. I've also tried adjusting the read_buffer_size (2x), sort_buffer_size (2x) and tmp_table_size (4x) variables without significant effect.
@Chris - Those other 3 buffer_sizes are not relevant to ADD INDEX. The buffer_pool_size is important for the change_buffer (incremental updates) and for general reads/writes. The bigger table is ~300GB? And you have less RAM than that?
Table in question is something like that. I have 16GB RAM total, usage is pushing 10 at the moment with the expansion of innodb_buffer_pool_size. All of this is being run off an external disk (moved mysql installation due to space constraints on my machine's SSD) where I have plenty of disk space (~2T+)
@Chris - Good. Someday you may need that 2T to do an ALTER that can't be done INPLACE.

Rick James Rick James 80.7k5 gold badges52 silver badges119 bronze badges · Answer 2 · 2020-03-15 04:36:03Z

If your goal is to replace a table with new data, I suggest the following:

CREATE TABLE new;
load the csv data
Cleanse the data (if needed)
ALTER TABLE new ADD INDEX ... ENGINE=InnoDB;
cleanse the data (if needed); fix datatypes for columns, etc
RENAME TALE real TO old, new TO real;
DROP TABLE old;

Step 6 is virtually instantaneous, so applications can always see the data.

If it takes 22 hours, be sure that it is not still running when tomorrow's copy is being loaded. That is have some sort of lock on the script.

Stack Exchange Network

Significant processing time creating indexes on a large MySQL table

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Significant processing time creating indexes on a large MySQL table

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions