3

I'm trying to run the following join query as part of a more complex one on MariaDB 10.1.26.

select distinct
 project_commits.project_id,
 date_format(created_at, '%x%v1') as week_commit
 from project_commits
 left join commits
 on project_commits.commit_id = commits.id;

Both join fields are indexed. However, the join involves a full scan of project_commits and an index lookup on commits. This is corroborated by the output of EXPLAIN.

+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
| 1 | SIMPLE | project_commits | ALL | NULL | NULL | NULL | NULL | 5417294109 | Using temporary |
| 1 | SIMPLE | commits | eq_ref | PRIMARY | PRIMARY | 4 | ghtorrent.project_commits.commit_id | 1 | |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+

The sizes of the two tables are relatively large: project_commits contains 5 billion rows and commits 847 million rows. Also the server's memory size is relatively small (16GB). This probably means that index lookups hit the (unfortunately magnetic) disk, and therefore performance takes a heavy hit. According to the output of pmonitor run on the generated temporary table, the query, which has already run for more than half a day, will take another 373 hours to complete.

/home/mysql/ghtorrent/project_commits#P#p0.MYD 6.68% ETA 373:38:11

Could I somehow increase the query's performance either by partitioning the tables, so that the join can be performed in memory, or by forcing MySQL to perform a sort-merge join? As the time involved for running alternative strategies could be many hours, I'd rather have a suggestion, instead of trying things out.

asked Aug 3, 2018 at 8:00
4
  • why are you taking project_commits.project_id instead of commits.project_id? Commented Aug 9, 2018 at 21:42
  • Please provide SHOW CREATE TABLE for each table. Commented Aug 21, 2018 at 6:38
  • PARTITIONs will not help. Commented Aug 21, 2018 at 6:39
  • Do you really need LEFT? It inhibits starting with the other table. Commented Aug 21, 2018 at 6:40

3 Answers 3

6

From the looks of the EXPLAIN plan, you are doing a full table scan on project_commits.

From the looks of the query, I surmise there is a one-to-many relationship from commits to project_commits. The problem I see is that your query is gathering data as many-on-one.

You are also using LEFT JOIN.

Your query is saying:

Show me all project_commits that are connected to or orphaned from commits

Instead, you might want the query to say:

Show me all project_commits that are connected to commits

SUGGESTION #1 : Flip Table Order

EXPLAIN select distinct
 project_commits.project_id,
 date_format(created_at, '%x%v1') as week_commit
 from commits
 left join project_commits
 on commits.id = project_commits.commit_id;

SUGGESTION #2 : Use INNER JOIN

EXPLAIN select distinct
 project_commits.project_id,
 date_format(created_at, '%x%v1') as week_commit
 from project_commits
 inner join commits
 on project_commits.commit_id = commits.id;

SUGGESTION #3 : Add created_at index

If you are already have an index on created_at or if you already have a compound index whose first column is created_at, skip this suggestion.

ALTER TABLE `project_commits` ADD INDEX created_at_ndx (`created_at`);

Adding an index will change EXPLAIN plan to do an index scan instead of a table scan

SUGGESTION #4 : Alter the JOIN behavior

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

GIVE IT A TRY !!!

Note : I have no idea what to expect from my suggestions. After all, your LEFT JOIN is like an iterative Cartesian Join with the potential of make a temp table that is the following

4.573 Quintillion rows (5.4 Billion X 0.847 Billion) being looked up

Have fun and let us know what you find

answered Aug 3, 2018 at 13:03
2
  • Thank you for the many suggestions! The relationship is indeed a one-to-many relationship from commits to project_commits. According to EXPLAIN the inner join alternative will perform a scan on the smaller commit table, so I'm trying that out. Commented Aug 3, 2018 at 17:09
  • Unfortunately, after 20 hours of processing no results have come out. I'm terminating the query, because back-of-the envelope calculations tell me that this is a task that could be performed in less than 10 hours. Commented Aug 4, 2018 at 12:37
1

As MariaDB doesn't seem to support sort-merge joins, I ended up exporting the required fields into two sorted files, performing the JOIN and DISTINCT with the Unix join and uniq tools, and importing the result back into the database. The operation took fewer than 12 hours to complete. Full details are here.

answered Aug 5, 2018 at 11:28
0

creating a covering index on table commits and project_commits should enable an index scan in the correct order can be done.

ALTER TABLE `commits` ADD INDEX created_at_ndx (`id`, `created_at`); 
ALTER TABLE `project_commits` ADD INDEX commit_id_ndx (`commit_id`, `project_id`); 

Then you can do

select distinct project_id, date_format(created_at, '%x%v1') as week_commit
 from (select 
 project_commits.project_id,
 commits.created_at
 from commits 
 inner join project_commits
 on project_commits.commit_id = commits.id);

PLane should now look like this

+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
| 1 | PRIMARY | | ALL | | | | | ... | Using temporary |
| 2 | DERIVED | project_commits | index | commit_id_ndx | PRIMARY | 8 | | ... | Using index |
| 2 | DERIVED | commits | eq_ref | PRIMARY,created_at_ndx | PRIMARY | 4 | db_9_050611.project_commits.commit_id | ... | |
+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
answered Aug 6, 2018 at 18:17
3
  • Other people have also recommended this index. I will investigate it. Why the nested SELECT? Commented Aug 9, 2018 at 13:01
  • its just an assumption but the query optimizer might not be smart enough to know it need to stream created_at field out of the join and might try to execute date_format before doing the join or pull all collumn from the "commits" table instead of the only 2 we need. Commented Aug 9, 2018 at 21:38
  • I tried building the combined index. As I commented on the corresponding Reddit discussion that this index would take more than one hundred hours to build, which is impractical for a one-off query. reddit.com/r/programming/comments/94r6w6/… Commented Aug 11, 2018 at 8:06

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.