MySQL 5.7 - B-Tree indexes, performance and fragmentation

Question 1

I have a large table with a timestamp column and it holds a very large amount of data (a few million rows).

Whenever running a filtered query by range on the timestamp column for this table, depending on the size of the range, it takes a considerable amount of time (2+ seconds). This column has an index created for it which would lead me to believe that the query return should be faster.

I noticed that by destroying the index and re-adding it, the query performs easily 40% better.

Bellow is a simple example of how the table is structured:

create table foo (
 id int(11) not null auto_increment,
 fk_id int(11) not null,
 datecolumn timestamp,
 primary key (id, fk_id),
 key idx_timestamp (datecolumn)
 constraint fk_fk_id foreign key (fk_id) references bla (id)
 ) engine = InnoDB;

Any reason why the re-creation of the index would make the query run faster and how can I make this as part of a maintenance ? I understand that MySQL InnoDB default index is BTree and it can get messy but it should have a way to optimize or re-organize these so the query performance is not so bad.

Note: I have ran

optimize table foo;
flush tables;

And these didn't seem to solve anything related to the index.

When you use explain it shows some indexes being used but mainly, it shows the type ranged and under this type that there is most of number of rows analyzed to return just a few of them.

fk_id is part of the where clause, needs to be, it is also part of a composite key which means that it is deterministic to return the results.

The query is basically this:

select id, fk_id, datecolumn, bar.id, bar.name, bar.type
 from foo
 join bar on bar.id = foo.fk_id
 where foo.datecolumn between '2016-01-01 05:00:00' and '2017-01-01 04:59:59'
 and foo.fk_id in (1, 2, 4)

Even if the query was a simple select where datecolumn between range I get the same result.

Question 2

When you use explain does it show that the index is being used? There is a good chance that it isn't because the index probably doesn't cover your query. use-the-index-luke.com/sql/clustering/… Additional info dev.mysql.com/doc/refman/5.7/en/…

Question 3

I have a similar case, index performance decreases by factor 50 within 30 hours. Query cost goes from 1.1 to 1 million in that time. I tried HASH and BTREE, no difference except that hash takes longer to kill the server. Mysql is deeply flawed in terms of indexation when you have a write-heavy environment

Question 4

BTree blocks inherently go from 100% full (actually about 15/16) to about 69% full after a lot of random INSERTs. This may account for the 40% you are seeing. But note, it won't get any slower over time. Hence, I see frequent OPTIMIZE TABLE as a waste of time, especially if you need the table to be accessible all the time.

And, your 40% faster will rapidly degrade.

I would expect FLUSH TABLES to actually hurt -- since it will remove cached info from RAM.

In order to discuss the query, please show us the query, its EXPLAIN SELECT ..., and perhaps SHOW CREATE TABLE bla.

foo is essentially two tables, one ordered (id, fk_id, datecolumn), the other ordered (datecolumn, id, fk_id). Plus, I think, a secondary index of (fk_id, id) (no datecolumn).

If fk_id is part of the WHERE clause, none of them may be optimal.

Addenda

(From OP's comment:)

select id, fk_id, datecolumn, bar.id, bar.name, bar.type
 from foo
 join bar ON bar.id = foo.fk_id
 where foo.datecolumn between '2016-01-01 05:00:00'
 AND '2017-01-01 04:59:59'
 and foo.fk_id in (1, 2, 4)

The optimal indexes are

foo: INDEX(fk_id, datecolumn) -- 3 disjoint ranges
bar: PRIMARY KEY(id) -- I assume it is already the PK; if not, then INDEX(id)

Note that you do not currently have the optimal index for foo. Add it, and your need for OPTIMIZE TABLE should diminish.

As for why the optimize seemed to help -- probably caching. How much RAM do you have? What is the value of innodb_buffer_pool_size?

I prefer this pattern:

 where foo.datecolumn between '2016-01-01 05:00:00'
 AND '2016-01-01 05:00:00' + INTERVAL 1 YEAR

(No hassles with leap years, datatypes, etc)

More on indexing.

Question 5

Stumpbled over this one while researching on my issue, take a look here: dba.stackexchange.com/questions/228115/… Index performance reduces by 50 times within less than 2 days. Query cost increases by 750,000 times within that time. Recreating the index solves it (cost 1.1). It happens with HASH and BTREE indexes. Recreating the index is the ONLY way (except for optimize) that fixes the issue for another day.

Question 6

@John - I suspect there is something about indexing virtual columns that led to that problem. I have been studying MySQL indexes for nearly 2 decades; I think I know more of the quirks.

Question 7

I thought so too in the beginning but I have invested about 20 hours into this single problem by now. I have recreated the table countless times and created all variants of fields. I tried VIRTUAL, STORED and normal TINYINT() with same results. I tried BTREE and HASH indexes and can reproduce the issue at any moment. I don't know what the issue is but it's not related to virtual/generation anymore. Currently I use a trigger to write the value into TINYINT(1) and I replaced it so it only holds NULLs and a few 1's. I know you are experienced but the issue is infront of my eyes:(

Question 8

InnoDB has no HASH indexes; that keyword is silently ignored. Unfortunately, I do not have enough experience specifically with Virtual/Stored/Generated + Indexed to spot the issue. One thing to note is that low-cardinality indexes (eg flags or TINYINT with few distinct values) are rarely used. You seem to be depending on using such?

Question 9

I am not using a virtual/stored/generated column anymore. It's a normal tinyint(1) column. Thanks for the info on Hash. That explains why it didn't solve the problem. I am using it because it should be the fastest way to identify if a column is ready to be enqueued. Switches from NULL to 1, once finished from 1 to NULL

Question 10

To me, the table design looks broken.

The PK is (id, fk_id), which allows multiple rows per 'id'. Is that really wanted?

The timestamp is a fully dynamic one (DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP), is that needed?

The name 'datecolumn' suggests it is a 'date', so why not use the 'date' type for it?

Rick James Rick James 80.7k5 gold badges52 silver badges119 bronze badges · Answer 1 · 2016-11-03 00:13:54Z

BTree blocks inherently go from 100% full (actually about 15/16) to about 69% full after a lot of random INSERTs. This may account for the 40% you are seeing. But note, it won't get any slower over time. Hence, I see frequent OPTIMIZE TABLE as a waste of time, especially if you need the table to be accessible all the time.

And, your 40% faster will rapidly degrade.

I would expect FLUSH TABLES to actually hurt -- since it will remove cached info from RAM.

In order to discuss the query, please show us the query, its EXPLAIN SELECT ..., and perhaps SHOW CREATE TABLE bla.

foo is essentially two tables, one ordered (id, fk_id, datecolumn), the other ordered (datecolumn, id, fk_id). Plus, I think, a secondary index of (fk_id, id) (no datecolumn).

If fk_id is part of the WHERE clause, none of them may be optimal.

Addenda

(From OP's comment:)

select id, fk_id, datecolumn, bar.id, bar.name, bar.type
 from foo
 join bar ON bar.id = foo.fk_id
 where foo.datecolumn between '2016-01-01 05:00:00'
 AND '2017-01-01 04:59:59'
 and foo.fk_id in (1, 2, 4)

The optimal indexes are

foo: INDEX(fk_id, datecolumn) -- 3 disjoint ranges
bar: PRIMARY KEY(id) -- I assume it is already the PK; if not, then INDEX(id)

Note that you do not currently have the optimal index for foo. Add it, and your need for OPTIMIZE TABLE should diminish.

As for why the optimize seemed to help -- probably caching. How much RAM do you have? What is the value of innodb_buffer_pool_size?

I prefer this pattern:

 where foo.datecolumn between '2016-01-01 05:00:00'
 AND '2016-01-01 05:00:00' + INTERVAL 1 YEAR

(No hassles with leap years, datatypes, etc)

More on indexing.

Stumpbled over this one while researching on my issue, take a look here: dba.stackexchange.com/questions/228115/… Index performance reduces by 50 times within less than 2 days. Query cost increases by 750,000 times within that time. Recreating the index solves it (cost 1.1). It happens with HASH and BTREE indexes. Recreating the index is the ONLY way (except for optimize) that fixes the issue for another day.
@John - I suspect there is something about indexing virtual columns that led to that problem. I have been studying MySQL indexes for nearly 2 decades; I think I know more of the quirks.
I thought so too in the beginning but I have invested about 20 hours into this single problem by now. I have recreated the table countless times and created all variants of fields. I tried VIRTUAL, STORED and normal TINYINT() with same results. I tried BTREE and HASH indexes and can reproduce the issue at any moment. I don't know what the issue is but it's not related to virtual/generation anymore. Currently I use a trigger to write the value into TINYINT(1) and I replaced it so it only holds NULLs and a few 1's. I know you are experienced but the issue is infront of my eyes:(
InnoDB has no HASH indexes; that keyword is silently ignored. Unfortunately, I do not have enough experience specifically with Virtual/Stored/Generated + Indexed to spot the issue. One thing to note is that low-cardinality indexes (eg flags or TINYINT with few distinct values) are rarely used. You seem to be depending on using such?
I am not using a virtual/stored/generated column anymore. It's a normal tinyint(1) column. Thanks for the info on Hash. That explains why it didn't solve the problem. I am using it because it should be the fastest way to identify if a column is ready to be enqueued. Switches from NULL to 1, once finished from 1 to NULL

Ruud H.G. van Tol Ruud H.G. van Tol 1 · Answer 2 · 2017-02-24 10:47:32Z

To me, the table design looks broken.

The PK is (id, fk_id), which allows multiple rows per 'id'. Is that really wanted?

The timestamp is a fully dynamic one (DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP), is that needed?

The name 'datecolumn' suggests it is a 'date', so why not use the 'date' type for it?

Stack Exchange Network

MySQL 5.7 - B-Tree indexes, performance and fragmentation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

MySQL 5.7 - B-Tree indexes, performance and fragmentation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions