MySQL optimize range duplicate query

Question 1

I need to eliminate duplicates from a time sheet. I found this solution and have adapted it for my own needs:

DROP TABLE IF EXISTS `activity`;
CREATE TABLE IF NOT EXISTS `activity` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `planned_start` datetime DEFAULT NULL,
 `planned_end` datetime DEFAULT NULL,
 `actual_start` datetime DEFAULT NULL,
 `actual_end` datetime DEFAULT NULL,
 `code_id` int(11) DEFAULT NULL,
 `setting_id` int(11) DEFAULT NULL,
 `notes` text,
 `travel_distance` decimal(8,2) DEFAULT NULL,
 `created_by` int(11) NOT NULL,
 `updated_by` int(11) DEFAULT NULL,
 `submitted` tinyint(1) DEFAULT NULL,
 `approved` datetime DEFAULT NULL,
 `approved_by` int(11) DEFAULT NULL,
 `created` datetime NOT NULL,
 `updated` datetime NOT NULL,
 `peer_engagement_id` int(11) DEFAULT NULL,
 `person_id` int(11) DEFAULT NULL,
 `travel_notes` varchar(8000) DEFAULT NULL,
 PRIMARY KEY (`id`),
 UNIQUE KEY `id` (`id`),
 KEY `code_id_idx` (`code_id`),
 KEY `setting_id_idx` (`setting_id`),
 KEY `created_by_idx` (`created_by`),
 KEY `updated_by_idx` (`updated_by`),
 KEY `approved_by_idx` (`approved_by`),
 KEY `activity_peer_engagement_id_fk` (`peer_engagement_id`),
 KEY `activity_person_id_fk` (`person_id`),
 KEY `actual_start` (`actual_start`,`actual_end`),
 KEY `created` (`created`),
 KEY `person_id` (`person_id`,`actual_start`,`actual_end`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=165796 ;
SELECT
 COUNT(*) as occurrence
 , sub.id
 , SEC_TO_TIME(SUM(
 IF(a2start > a1start, a1end - a2start, a2end - a1start))) as duration
FROM
 ( SELECT
 a1.id
 , UNIX_TIMESTAMP(a1.actual_start) as a1start
 , UNIX_TIMESTAMP(a1.actual_end) as a1end
 , UNIX_TIMESTAMP(a2.actual_start) as a2start
 , UNIX_TIMESTAMP(a2.actual_end) as a2end
 FROM activity a1
 INNER JOIN activity a2
 ON (a1.id <> a2.id and a1.person_id=a2.person_id
 AND NOT(a1.actual_start > a2.actual_end OR a1.actual_end < a2.actual_start))
 ) sub

Problem is I can't even run explain on my query, my mysql server goes into 100% CPU usage and seems to stay there for minutes.

I can run explain on the inner query:

explain SELECT
 a1.id
 , UNIX_TIMESTAMP(a1.actual_start) as a1start
 , UNIX_TIMESTAMP(a1.actual_end) as a1end
 , UNIX_TIMESTAMP(a2.actual_start) as a2start
 , UNIX_TIMESTAMP(a2.actual_end) as a2end
FROM activity a1
INNER JOIN activity a2
 ON (a1.id <> a2.id and a1.person_id=a2.person_id
 AND NOT(a1.actual_start > a2.actual_end OR a1.actual_end < a2.actual_start))
+----+-------------+-------+-------+----------------------------------------------+-----------+---------+--------------------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------------------+-----------+---------+--------------------------------------+--------+--------------------------+
| 1 | SIMPLE | a1 | index | activity_person_id_fk,actual_start,person_id | person_id | 23 | NULL | 176586 | Using index |
| 1 | SIMPLE | a2 | ref | activity_person_id_fk,actual_start,person_id | person_id | 5 | mabel_mindandbody_co_nz.a1.person_id | 19705 | Using where; Using index |
+----+-------------+-------+-------+----------------------------------------------+-----------+---------+--------------------------------------+--------+--------------------------+
2 rows in set (0.00 sec)

My questions:

Why doesn't explain work here?
How can I optimize this query to give acceptable speed results?

In regards to optimization - I cannot find anything else than the indexes I already used in my table.

One other option I have thought about is to add an additional field, encoding every day into a number. I do know that a time sheet entry is never longer than 24 hours and I'm sure excluding midnight spanning time sheet entries would be acceptable. So with this I would hope to use a smaller index on this additional column on the inside query.

Question 2

Not sure, but your join criteria looks quite complicated. How about moving some of it to WHERE clause?

Question 3

@Manny, tried this, but no change. I did not expect this to hav any effect as existing indexes were already used, but the resultset was still too big

Question 4

With the assumption of no timesheet entry spanning midnight, I have added a column

person_date varchar(30) not null

This is being calculated overnight with

update activity set person_date = concat(person_id , '_',date(actual_start) ) where person_date='';

I have also simplified the inside query to:

SELECT a1.id, 
a1.person_id, 
UNIX_TIMESTAMP( a1.actual_start ) AS a1start,
UNIX_TIMESTAMP( a1.actual_end ) AS a1end, 
UNIX_TIMESTAMP( a2.actual_start ) AS a2start, 
UNIX_TIMESTAMP( a2.actual_end ) AS a2end
FROM activity a1
INNER JOIN activity a2
WHERE (
a1.id < a2.id
AND a1.person_date = a2.person_date
AND a1.actual_start < a2.actual_end
AND a1.actual_end > a2.actual_start

Firstly, because my users record their time to hourly boundaries a lot and we have '0:00' overlap then, secondly with a1.id<>a2.id, we find every entry twice, when we need it only one.

I have concluded that there is no good way to optimize for a1.id < a2.id, so with person_date I'm optimizing for something else.

Question 5

You also have an index starting with person_date?

automatem automatem 1612 silver badges12 bronze badges · Answer 1 · 2014-11-10 01:16:10Z

With the assumption of no timesheet entry spanning midnight, I have added a column

person_date varchar(30) not null

This is being calculated overnight with

update activity set person_date = concat(person_id , '_',date(actual_start) ) where person_date='';

I have also simplified the inside query to:

SELECT a1.id, 
a1.person_id, 
UNIX_TIMESTAMP( a1.actual_start ) AS a1start,
UNIX_TIMESTAMP( a1.actual_end ) AS a1end, 
UNIX_TIMESTAMP( a2.actual_start ) AS a2start, 
UNIX_TIMESTAMP( a2.actual_end ) AS a2end
FROM activity a1
INNER JOIN activity a2
WHERE (
a1.id < a2.id
AND a1.person_date = a2.person_date
AND a1.actual_start < a2.actual_end
AND a1.actual_end > a2.actual_start

Firstly, because my users record their time to hourly boundaries a lot and we have '0:00' overlap then, secondly with a1.id<>a2.id, we find every entry twice, when we need it only one.

I have concluded that there is no good way to optimize for a1.id < a2.id, so with person_date I'm optimizing for something else.

You also have an index starting with person_date?

Rick James
– Rick James

2016年12月15日 19:46:52 +00:00
Commented Dec 15, 2016 at 19:46

Stack Exchange Network

MySQL optimize range duplicate query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

MySQL optimize range duplicate query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions