I'm creating indexes on a few tables I've recently created within MySQL.
The first (call below) completed within ~20 minutes and was applied to a table with a little over 20M records:
mysql> alter table fu.myTable add index(date), add index(id), add index(symbol), add index(method), algorithm=inplace, lock=none;
Query OK, 0 rows affected (20 min 20.52 sec)
Records: 0 Duplicates: 0 Warnings: 0
The second, applied to a much larger table consisting of nearly 1.1B rows, is taking considerably longer (~19 hours in at the moment). Sorting is obviously an O(n*log(n)) operation, so for a table 50x the size, by my calcs, processing on the larger table would be expected to take ~22 hours. Call I'm using for it below:
mysql> alter table op.myOtherTable add index(dataDate), add index(ticker), add index(symbol), add index(expDate), algorithm=inplace, lock=none;
With that, I wonder if there are any settings I may be able to adjust to speed things up some given my machine doesn't appear to be overly taxed at this point.
Edit: The dataset consists of ~20 years of historical data and is parsed into daily CSVs (of ~+/-1M rows) to make it a little easier to work with. I've successfully gotten the data into MySQL but now need to apply indexes to make it usable. Fields I will likely search on are those I'm indexing on (dataDate, ticker, symbol, expDate). Additional detail below:
mysql> select table_rows from information_schema.tables where table_name = 'options';
+------------+
| table_rows |
+------------+
| 1097863845 |
+------------+
1 row in set (2.70 sec)
mysql> show create table op.options;
+---------+------------------------------------------------------------+
| Table | Create Table |
+---------+------------------------------------------------------------+
| options | CREATE TABLE `options` (
`dataDate` date NOT NULL,
`ticker` varchar(8) NOT NULL,
`symbol` varchar(30) NOT NULL,
`expDate` date NOT NULL,
`type` char(4) NOT NULL,
`price` double DEFAULT NULL,
`strike` double DEFAULT NULL,
`last` double DEFAULT NULL,
`bid` double DEFAULT NULL,
`ask` double DEFAULT NULL,
`volume` int(11) DEFAULT NULL,
`OI` int(11) DEFAULT NULL,
`IV` double DEFAULT NULL,
`delta` double DEFAULT NULL,
`gamma` double DEFAULT NULL,
`theta` double DEFAULT NULL,
`vega` double DEFAULT NULL,
`aka` varchar(30) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
+----------------------------------------------------------------------+
1 row in set (0.01 sec)
Problem table layout sample:
mysql> select * from op.options limit 10;
+------------+--------+------------------+------------+------+-------+--------+------+------+------+--------+------+--------+---------+--------+---------+--------+------+
| dataDate | ticker | optionSymbol | expDate | type | price | strike | last | bid | ask | volume | OI | IV | delta | gamma | theta | vega | aka |
+------------+--------+------------------+------------+------+-------+--------+------+------+------+--------+------+--------+---------+--------+---------+--------+------+
| 2003年02月26日 | A | A030322C00010000 | 2003年03月22日 | call | 12.8 | 10 | 3.7 | 2.8 || 3.1 | 0 | 66 | 0.8389 | 0.8987 | 0.0653 | -3.8774 | 0.5735 | ACB
| 2003年02月26日 | A | A030322P00010000 | 2003年03月22日 | put | 12.8 | 10 | 0.15 | 0. || 0.2 | 0 | 1436 | 0.8448 | -0.1027 | 0.0654 | -3.8059 | 0.5789 | AOB
| 2003年02月26日 | A | A030322C00012500 | 2003年03月22日 | call | 12.8 | 12.5 | 1.1 | 0.9 || 1.05 | 0 | 6274 | 0.6549 | 0.5914 | 0.1833 | -6.5268 | 1.2571 | ACV
| 2003年02月26日 | A | A030322P00012500 | 2003年03月22日 | put | 12.8 | 12.5 | 0.6 | 0.6 || 0.75 | 75 | 929 | 0.6644 | -0.4089 | 0.1807 | -6.4545 | 1.2573 | AOV
| 2003年02月26日 | A | A030322C00015000 | 2003年03月22日 | call | 12.8 | 15 | 0.15 | 0. || 0.2 | 45 | 903 | 0.5812 | 0.1586 | 0.1286 | -3.5849 | 0.7831 | ACC
| 2003年02月26日 | A | A030322P00015000 | 2003年03月22日 | put | 12.8 | 15 | 2.15 | 2.2 || 2.4 | 0 | 107 | 0.558 | -0.8526 | 0.1276 | -3.0784 | 0.7459 | AOC
| 2003年02月26日 | A | A030322C00017500 | 2003年03月22日 | call | 12.8 | 17.5 | 0.05 | || 0.1 | 0 | 439 | 0.7012 | 0.0474 | 0.0435 | -1.7616 | 0.3198 | ACW
| 2003年02月26日 | A | A030322P00017500 | 2003年03月22日 | put | 12.8 | 17.5 | 5 | 4. || 4.8 | 0 | 215 | 0.558 | -0.9838 | 0.0224 | -0.3395 | 0.1307 | AOW
| 2003年02月26日 | A | A030322C00020000 | 2003年03月22日 | call | 12.8 | 20 | 0.05 | || 0.05 | 0 | 224 | 0.8113 | 0.0193 | 0.0179 | -0.9668 | 0.1519 | ACD
| 2003年02月26日 | A | A030322P00020000 | 2003年03月22日 | put | 12.8 | 20 | 7.1 | 7. || 7.3 | 0 | 6 | 0.558 | -0.999 | 0.0019 | 0.2187 | 0.011 | AOD
+------------+--------+------------------+------------+------+-------+--------+------+------+------+--------+------+--------+---------+--------+---------+--------+------+
10 rows in set (0.44 sec)
2 Answers 2
- Every day you bulk-insert 1M rows to add to a 1B-row table, correct?
- There are several indexes on the table?
- The goal is to add the rows reasonably fast?
- The table is otherwise "readonly", so we are not worrying about writes?
- The table is InnoDB, I hope.
Then...
Leave the indexes in place. The table is growing by only 0.1% each day. It would be faster to incrementally update the indexes than to drop and rebuild them.
- InnoDB indexes are B+Trees.
- Inserting one row, for example, will augment the B+Tree for each index.
- The "Change buffer" delays the actual I/O involved in each index update, thereby making it not as bad as you might imagine.
- Keep
innodb_buffer_pool_size
at a suitable size. (How much RAM do you have. The simple default of 70% of RAM may be close to optimal.) (The Change buffer is in the buffer_pool.)
Schema
For a billion-row table, space is important...
symbol
andticker
? Do you need both? Consider normalizing the pair.DOUBLE
(8 bytes, 16 significant digits) is probably overkill forgamma
, etc. ConsiderFLOAT
(4 bytes, 7 significant digits)- Unless you have some limitations,
volume
can exceed a 32-bitINT
(2B limit) or evenINT UNSIGNED
(4B limit) - I don't see
id
. If it is anAUTO_INCREMENT
you add later, then I suggest you are risking overflow if it isINT
. - A 'good'
PRIMARY KEY
would be(date, symbol_id)
or possibly the opposite order. - What
SELECTs
do you have? If you are fetching "all info for one ticker",(symbol_id, date)
is optimal. If you are talking about "today's data" then the opposite is better. type
could be anENUM
(1 byte)
INDEX(a), INDEX(b)
is not the same as INDEX(a,b)
. The latter is significantly better in some situations. For querying a billion rows, you need to understand the differences. http://mysql.rjweb.org/doc.php/index_cookbook_mysql
-
Believe the above answers your Qs. I've upped
innodb_buffer_pool_size
because I have extra RAM as a 'just in case'--it hasn't seemed to help with indexing. I've also tried adjusting theread_buffer_size
(2x),sort_buffer_size
(2x) andtmp_table_size
(4x) variables without significant effect.Chris– Chris2020年03月15日 05:07:22 +00:00Commented Mar 15, 2020 at 5:07 -
@Chris - Those other 3 buffer_sizes are not relevant to
ADD INDEX
. The buffer_pool_size is important for the change_buffer (incremental updates) and for general reads/writes. The bigger table is ~300GB? And you have less RAM than that?Rick James– Rick James2020年03月15日 05:15:52 +00:00Commented Mar 15, 2020 at 5:15 -
ADD INDEX
mostly works with disk.Rick James– Rick James2020年03月15日 05:17:03 +00:00Commented Mar 15, 2020 at 5:17 -
Table in question is something like that. I have 16GB RAM total, usage is pushing 10 at the moment with the expansion of
innodb_buffer_pool_size
. All of this is being run off an external disk (moved mysql installation due to space constraints on my machine's SSD) where I have plenty of disk space (~2T+)Chris– Chris2020年03月15日 05:22:01 +00:00Commented Mar 15, 2020 at 5:22 -
@Chris - Good. Someday you may need that 2T to do an ALTER that can't be done INPLACE.Rick James– Rick James2020年03月15日 05:27:08 +00:00Commented Mar 15, 2020 at 5:27
If your goal is to replace a table with new data, I suggest the following:
CREATE TABLE new
;- load the csv data
- Cleanse the data (if needed)
ALTER TABLE new ADD INDEX ... ENGINE=InnoDB;
- cleanse the data (if needed); fix datatypes for columns, etc
RENAME TALE real TO old, new TO real;
DROP TABLE old;
Step 6 is virtually instantaneous, so applications can always see the data.
If it takes 22 hours, be sure that it is not still running when tomorrow's copy is being loaded. That is have some sort of lock on the script.
id
is thePRIMARY KEY
, don't add a redundantINDEX(id)
.