I'm trying to run the following join query as part of a more complex one on MariaDB 10.1.26.
select distinct
project_commits.project_id,
date_format(created_at, '%x%v1') as week_commit
from project_commits
left join commits
on project_commits.commit_id = commits.id;
Both join fields are indexed. However, the join involves a full scan of project_commits
and an index lookup on commits
. This is corroborated by the output of EXPLAIN
.
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
| 1 | SIMPLE | project_commits | ALL | NULL | NULL | NULL | NULL | 5417294109 | Using temporary |
| 1 | SIMPLE | commits | eq_ref | PRIMARY | PRIMARY | 4 | ghtorrent.project_commits.commit_id | 1 | |
+------+-------------+-----------------+--------+---------------+---------+---------+-------------------------------------+------------+-----------------+
The sizes of the two tables are relatively large: project_commits
contains 5 billion rows and commits
847 million rows. Also the server's memory size is relatively small (16GB). This probably means that index lookups hit the (unfortunately magnetic) disk, and therefore performance takes a heavy hit. According to the output of pmonitor run on the generated temporary table, the query, which has already run for more than half a day, will take another 373 hours to complete.
/home/mysql/ghtorrent/project_commits#P#p0.MYD 6.68% ETA 373:38:11
Could I somehow increase the query's performance either by partitioning the tables, so that the join can be performed in memory, or by forcing MySQL to perform a sort-merge join? As the time involved for running alternative strategies could be many hours, I'd rather have a suggestion, instead of trying things out.
3 Answers 3
From the looks of the EXPLAIN plan, you are doing a full table scan on project_commits
.
From the looks of the query, I surmise there is a one-to-many relationship from commits
to project_commits
. The problem I see is that your query is gathering data as many-on-one.
You are also using LEFT JOIN
.
Your query is saying:
Show me all
project_commits
that are connected to or orphaned fromcommits
Instead, you might want the query to say:
Show me all
project_commits
that are connected tocommits
SUGGESTION #1 : Flip Table Order
EXPLAIN select distinct
project_commits.project_id,
date_format(created_at, '%x%v1') as week_commit
from commits
left join project_commits
on commits.id = project_commits.commit_id;
SUGGESTION #2 : Use INNER JOIN
EXPLAIN select distinct
project_commits.project_id,
date_format(created_at, '%x%v1') as week_commit
from project_commits
inner join commits
on project_commits.commit_id = commits.id;
SUGGESTION #3 : Add created_at
index
If you are already have an index on created_at
or if you already have a compound index whose first column is created_at
, skip this suggestion.
ALTER TABLE `project_commits` ADD INDEX created_at_ndx (`created_at`);
Adding an index will change EXPLAIN
plan to do an index scan instead of a table scan
SUGGESTION #4 : Alter the JOIN
behavior
You could manipulate the style of the join operation by tweeking the optimizer
According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins
When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.
For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.
This same page recommends doing this:
mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
GIVE IT A TRY !!!
Note : I have no idea what to expect from my suggestions. After all, your LEFT JOIN
is like an iterative Cartesian Join with the potential of make a temp table that is the following
4.573 Quintillion rows (5.4 Billion X 0.847 Billion) being looked up
Have fun and let us know what you find
-
Thank you for the many suggestions! The relationship is indeed a one-to-many relationship from
commits
toproject_commits
. According toEXPLAIN
theinner join
alternative will perform a scan on the smallercommit
table, so I'm trying that out.Diomidis Spinellis– Diomidis Spinellis2018年08月03日 17:09:09 +00:00Commented Aug 3, 2018 at 17:09 -
Unfortunately, after 20 hours of processing no results have come out. I'm terminating the query, because back-of-the envelope calculations tell me that this is a task that could be performed in less than 10 hours.Diomidis Spinellis– Diomidis Spinellis2018年08月04日 12:37:20 +00:00Commented Aug 4, 2018 at 12:37
As MariaDB doesn't seem to support sort-merge joins, I ended up exporting the required fields into two sorted files, performing the JOIN
and DISTINCT
with the Unix join and uniq tools, and importing the result back into the database. The operation took fewer than 12 hours to complete. Full details are here.
creating a covering index on table commits and project_commits should enable an index scan in the correct order can be done.
ALTER TABLE `commits` ADD INDEX created_at_ndx (`id`, `created_at`);
ALTER TABLE `project_commits` ADD INDEX commit_id_ndx (`commit_id`, `project_id`);
Then you can do
select distinct project_id, date_format(created_at, '%x%v1') as week_commit
from (select
project_commits.project_id,
commits.created_at
from commits
inner join project_commits
on project_commits.commit_id = commits.id);
PLane should now look like this
+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
| 1 | PRIMARY | | ALL | | | | | ... | Using temporary |
| 2 | DERIVED | project_commits | index | commit_id_ndx | PRIMARY | 8 | | ... | Using index |
| 2 | DERIVED | commits | eq_ref | PRIMARY,created_at_ndx | PRIMARY | 4 | db_9_050611.project_commits.commit_id | ... | |
+------+-------------+-----------------+--------+------------------------+---------+---------+---------------------------------------+------------+-----------------+
-
Other people have also recommended this index. I will investigate it. Why the nested SELECT?Diomidis Spinellis– Diomidis Spinellis2018年08月09日 13:01:40 +00:00Commented Aug 9, 2018 at 13:01
-
its just an assumption but the query optimizer might not be smart enough to know it need to stream created_at field out of the join and might try to execute date_format before doing the join or pull all collumn from the "commits" table instead of the only 2 we need.skyde– skyde2018年08月09日 21:38:11 +00:00Commented Aug 9, 2018 at 21:38
-
I tried building the combined index. As I commented on the corresponding Reddit discussion that this index would take more than one hundred hours to build, which is impractical for a one-off query. reddit.com/r/programming/comments/94r6w6/…Diomidis Spinellis– Diomidis Spinellis2018年08月11日 08:06:40 +00:00Commented Aug 11, 2018 at 8:06
SHOW CREATE TABLE
for each table.PARTITIONs
will not help.LEFT
? It inhibits starting with the other table.