Can I speed up a large MySQL/MariaDB join through partitioning or merging?

Question 1

I'm trying to run the following join query as part of a more complex one on MariaDB 10.1.26.

select distinct
 project_commits.project_id,
 date_format(created_at, '%x%v1') as week_commit
 from project_commits
 left join commits
 on project_commits.commit_id = commits.id;

Both join fields are indexed. However, the join involves a full scan of project_commits and an index lookup on commits. This is corroborated by the output of EXPLAIN.

+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
| 1 | SIMPLE | project_commits | ALL | NULL | NULL | NULL | NULL | 5417294109 | Using temporary |
| 1 | SIMPLE | commits | eq_ref | PRIMARY | PRIMARY | 4 | ghtorrent.project_commits.commit_id | 1 | |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+

The sizes of the two tables are relatively large: project_commits contains 5 billion rows and commits 847 million rows. Also the server's memory size is relatively small (16GB). This probably means that index lookups hit the (unfortunately magnetic) disk, and therefore performance takes a heavy hit. According to the output of pmonitor run on the generated temporary table, the query, which has already run for more than half a day, will take another 373 hours to complete.

/home/mysql/ghtorrent/project_commits#P#p0.MYD 6.68% ETA 373:38:11

Could I somehow increase the query's performance either by partitioning the tables, so that the join can be performed in memory, or by forcing MySQL to perform a sort-merge join? As the time involved for running alternative strategies could be many hours, I'd rather have a suggestion, instead of trying things out.

Question 2

why are you taking project_commits.project_id instead of commits.project_id?

Question 3

Please provide SHOW CREATE TABLE for each table.

Question 4

PARTITIONs will not help.

Question 5

Do you really need LEFT? It inhibits starting with the other table.

Question 6

From the looks of the EXPLAIN plan, you are doing a full table scan on project_commits.

From the looks of the query, I surmise there is a one-to-many relationship from commits to project_commits. The problem I see is that your query is gathering data as many-on-one.

You are also using LEFT JOIN.

Your query is saying:

Show me all project_commits that are connected to or orphaned from commits

Instead, you might want the query to say:

Show me all project_commits that are connected to commits

SUGGESTION #1 : Flip Table Order

EXPLAIN select distinct
 project_commits.project_id,
 date_format(created_at, '%x%v1') as week_commit
 from commits
 left join project_commits
 on commits.id = project_commits.commit_id;

SUGGESTION #2 : Use `INNER JOIN`

EXPLAIN select distinct
 project_commits.project_id,
 date_format(created_at, '%x%v1') as week_commit
 from project_commits
 inner join commits
 on project_commits.commit_id = commits.id;

SUGGESTION #3 : Add `created_at` index

If you are already have an index on created_at or if you already have a compound index whose first column is created_at, skip this suggestion.

ALTER TABLE `project_commits` ADD INDEX created_at_ndx (`created_at`);

Adding an index will change EXPLAIN plan to do an index scan instead of a table scan

SUGGESTION #4 : Alter the `JOIN` behavior

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

GIVE IT A TRY !!!

Note : I have no idea what to expect from my suggestions. After all, your LEFT JOIN is like an iterative Cartesian Join with the potential of make a temp table that is the following

4.573 Quintillion rows (5.4 Billion X 0.847 Billion) being looked up

Have fun and let us know what you find

Question 7

Thank you for the many suggestions! The relationship is indeed a one-to-many relationship from commits to project_commits. According to EXPLAIN the inner join alternative will perform a scan on the smaller commit table, so I'm trying that out.

Question 8

Unfortunately, after 20 hours of processing no results have come out. I'm terminating the query, because back-of-the envelope calculations tell me that this is a task that could be performed in less than 10 hours.

Question 9

As MariaDB doesn't seem to support sort-merge joins, I ended up exporting the required fields into two sorted files, performing the JOIN and DISTINCT with the Unix join and uniq tools, and importing the result back into the database. The operation took fewer than 12 hours to complete. Full details are here.

Question 10

creating a covering index on table commits and project_commits should enable an index scan in the correct order can be done.

ALTER TABLE `commits` ADD INDEX created_at_ndx (`id`, `created_at`); 
ALTER TABLE `project_commits` ADD INDEX commit_id_ndx (`commit_id`, `project_id`);

Then you can do

select distinct project_id, date_format(created_at, '%x%v1') as week_commit
 from (select 
 project_commits.project_id,
 commits.created_at
 from commits 
 inner join project_commits
 on project_commits.commit_id = commits.id);

PLane should now look like this

+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
| 1 | PRIMARY | | ALL | | | | | ... | Using temporary |
| 2 | DERIVED | project_commits | index | commit_id_ndx | PRIMARY | 8 | | ... | Using index |
| 2 | DERIVED | commits | eq_ref | PRIMARY,created_at_ndx | PRIMARY | 4 | db_9_050611.project_commits.commit_id | ... | |
+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+

Question 11

Other people have also recommended this index. I will investigate it. Why the nested SELECT?

Question 12

its just an assumption but the query optimizer might not be smart enough to know it need to stream created_at field out of the join and might try to execute date_format before doing the join or pull all collumn from the "commits" table instead of the only 2 we need.

Question 13

I tried building the combined index. As I commented on the corresponding Reddit discussion that this index would take more than one hundred hours to build, which is impractical for a one-off query. reddit.com/r/programming/comments/94r6w6/…

score 6 · Answer 1 · 2018-08-03 13:03:41Z

From the looks of the EXPLAIN plan, you are doing a full table scan on project_commits.

From the looks of the query, I surmise there is a one-to-many relationship from commits to project_commits. The problem I see is that your query is gathering data as many-on-one.

You are also using LEFT JOIN.

Your query is saying:

Show me all project_commits that are connected to or orphaned from commits

Instead, you might want the query to say:

Show me all project_commits that are connected to commits

SUGGESTION #1 : Flip Table Order

EXPLAIN select distinct
 project_commits.project_id,
 date_format(created_at, '%x%v1') as week_commit
 from commits
 left join project_commits
 on commits.id = project_commits.commit_id;

SUGGESTION #2 : Use `INNER JOIN`

EXPLAIN select distinct
 project_commits.project_id,
 date_format(created_at, '%x%v1') as week_commit
 from project_commits
 inner join commits
 on project_commits.commit_id = commits.id;

SUGGESTION #3 : Add `created_at` index

If you are already have an index on created_at or if you already have a compound index whose first column is created_at, skip this suggestion.

ALTER TABLE `project_commits` ADD INDEX created_at_ndx (`created_at`);

Adding an index will change EXPLAIN plan to do an index scan instead of a table scan

SUGGESTION #4 : Alter the `JOIN` behavior

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

GIVE IT A TRY !!!

Note : I have no idea what to expect from my suggestions. After all, your LEFT JOIN is like an iterative Cartesian Join with the potential of make a temp table that is the following

4.573 Quintillion rows (5.4 Billion X 0.847 Billion) being looked up

Have fun and let us know what you find

Thank you for the many suggestions! The relationship is indeed a one-to-many relationship from commits to project_commits. According to EXPLAIN the inner join alternative will perform a scan on the smaller commit table, so I'm trying that out.
Unfortunately, after 20 hours of processing no results have come out. I'm terminating the query, because back-of-the envelope calculations tell me that this is a task that could be performed in less than 10 hours.

Diomidis Spinellis Diomidis Spinellis 2232 silver badges10 bronze badges · Answer 2 · 2018-08-05 11:28:06Z

As MariaDB doesn't seem to support sort-merge joins, I ended up exporting the required fields into two sorted files, performing the JOIN and DISTINCT with the Unix join and uniq tools, and importing the result back into the database. The operation took fewer than 12 hours to complete. Full details are here.

skyde skyde 1721 silver badge9 bronze badges · Answer 3 · 2018-08-06 18:17:53Z

creating a covering index on table commits and project_commits should enable an index scan in the correct order can be done.

ALTER TABLE `commits` ADD INDEX created_at_ndx (`id`, `created_at`); 
ALTER TABLE `project_commits` ADD INDEX commit_id_ndx (`commit_id`, `project_id`);

Then you can do

select distinct project_id, date_format(created_at, '%x%v1') as week_commit
 from (select 
 project_commits.project_id,
 commits.created_at
 from commits 
 inner join project_commits
 on project_commits.commit_id = commits.id);

PLane should now look like this

+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
| 1 | PRIMARY | | ALL | | | | | ... | Using temporary |
| 2 | DERIVED | project_commits | index | commit_id_ndx | PRIMARY | 8 | | ... | Using index |
| 2 | DERIVED | commits | eq_ref | PRIMARY,created_at_ndx | PRIMARY | 4 | db_9_050611.project_commits.commit_id | ... | |
+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+

Other people have also recommended this index. I will investigate it. Why the nested SELECT?
its just an assumption but the query optimizer might not be smart enough to know it need to stream created_at field out of the join and might try to execute date_format before doing the join or pull all collumn from the "commits" table instead of the only 2 we need.
I tried building the combined index. As I commented on the corresponding Reddit discussion that this index would take more than one hundred hours to build, which is impractical for a one-off query. reddit.com/r/programming/comments/94r6w6/…

Stack Exchange Network

Can I speed up a large MySQL/MariaDB join through partitioning or merging?

3 Answers 3

SUGGESTION #1 : Flip Table Order

SUGGESTION #2 : Use `INNER JOIN`

SUGGESTION #3 : Add `created_at` index

SUGGESTION #4 : Alter the `JOIN` behavior

GIVE IT A TRY !!!

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Can I speed up a large MySQL/MariaDB join through partitioning or merging?

3 Answers 3

SUGGESTION #1 : Flip Table Order

SUGGESTION #2 : Use INNER JOIN

SUGGESTION #3 : Add created_at index

SUGGESTION #4 : Alter the JOIN behavior

GIVE IT A TRY !!!

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

SUGGESTION #2 : Use `INNER JOIN`

SUGGESTION #3 : Add `created_at` index

SUGGESTION #4 : Alter the `JOIN` behavior