0

I need to eliminate duplicates from a time sheet. I found this solution and have adapted it for my own needs:

DROP TABLE IF EXISTS `activity`;
CREATE TABLE IF NOT EXISTS `activity` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `planned_start` datetime DEFAULT NULL,
 `planned_end` datetime DEFAULT NULL,
 `actual_start` datetime DEFAULT NULL,
 `actual_end` datetime DEFAULT NULL,
 `code_id` int(11) DEFAULT NULL,
 `setting_id` int(11) DEFAULT NULL,
 `notes` text,
 `travel_distance` decimal(8,2) DEFAULT NULL,
 `created_by` int(11) NOT NULL,
 `updated_by` int(11) DEFAULT NULL,
 `submitted` tinyint(1) DEFAULT NULL,
 `approved` datetime DEFAULT NULL,
 `approved_by` int(11) DEFAULT NULL,
 `created` datetime NOT NULL,
 `updated` datetime NOT NULL,
 `peer_engagement_id` int(11) DEFAULT NULL,
 `person_id` int(11) DEFAULT NULL,
 `travel_notes` varchar(8000) DEFAULT NULL,
 PRIMARY KEY (`id`),
 UNIQUE KEY `id` (`id`),
 KEY `code_id_idx` (`code_id`),
 KEY `setting_id_idx` (`setting_id`),
 KEY `created_by_idx` (`created_by`),
 KEY `updated_by_idx` (`updated_by`),
 KEY `approved_by_idx` (`approved_by`),
 KEY `activity_peer_engagement_id_fk` (`peer_engagement_id`),
 KEY `activity_person_id_fk` (`person_id`),
 KEY `actual_start` (`actual_start`,`actual_end`),
 KEY `created` (`created`),
 KEY `person_id` (`person_id`,`actual_start`,`actual_end`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=165796 ;
SELECT
 COUNT(*) as occurrence
 , sub.id
 , SEC_TO_TIME(SUM(
 IF(a2start > a1start, a1end - a2start, a2end - a1start))) as duration
FROM
 ( SELECT
 a1.id
 , UNIX_TIMESTAMP(a1.actual_start) as a1start
 , UNIX_TIMESTAMP(a1.actual_end) as a1end
 , UNIX_TIMESTAMP(a2.actual_start) as a2start
 , UNIX_TIMESTAMP(a2.actual_end) as a2end
 FROM activity a1
 INNER JOIN activity a2
 ON (a1.id <> a2.id and a1.person_id=a2.person_id
 AND NOT(a1.actual_start > a2.actual_end OR a1.actual_end < a2.actual_start))
 ) sub

Problem is I can't even run explain on my query, my mysql server goes into 100% CPU usage and seems to stay there for minutes.

I can run explain on the inner query:

explain SELECT
 a1.id
 , UNIX_TIMESTAMP(a1.actual_start) as a1start
 , UNIX_TIMESTAMP(a1.actual_end) as a1end
 , UNIX_TIMESTAMP(a2.actual_start) as a2start
 , UNIX_TIMESTAMP(a2.actual_end) as a2end
FROM activity a1
INNER JOIN activity a2
 ON (a1.id <> a2.id and a1.person_id=a2.person_id
 AND NOT(a1.actual_start > a2.actual_end OR a1.actual_end < a2.actual_start))
+----+-------------+-------+-------+----------------------------------------------+-----------+---------+--------------------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------------------+-----------+---------+--------------------------------------+--------+--------------------------+
| 1 | SIMPLE | a1 | index | activity_person_id_fk,actual_start,person_id | person_id | 23 | NULL | 176586 | Using index |
| 1 | SIMPLE | a2 | ref | activity_person_id_fk,actual_start,person_id | person_id | 5 | mabel_mindandbody_co_nz.a1.person_id | 19705 | Using where; Using index |
+----+-------------+-------+-------+----------------------------------------------+-----------+---------+--------------------------------------+--------+--------------------------+
2 rows in set (0.00 sec)

My questions:

  • Why doesn't explain work here?
  • How can I optimize this query to give acceptable speed results?

In regards to optimization - I cannot find anything else than the indexes I already used in my table.

One other option I have thought about is to add an additional field, encoding every day into a number. I do know that a time sheet entry is never longer than 24 hours and I'm sure excluding midnight spanning time sheet entries would be acceptable. So with this I would hope to use a smaller index on this additional column on the inside query.

asked Nov 7, 2014 at 2:25
2
  • Not sure, but your join criteria looks quite complicated. How about moving some of it to WHERE clause? Commented Nov 7, 2014 at 19:43
  • @Manny, tried this, but no change. I did not expect this to hav any effect as existing indexes were already used, but the resultset was still too big Commented Nov 10, 2014 at 1:11

1 Answer 1

0

With the assumption of no timesheet entry spanning midnight, I have added a column

person_date varchar(30) not null

This is being calculated overnight with

update activity set person_date = concat(person_id , '_',date(actual_start) ) where person_date='';

I have also simplified the inside query to:

SELECT a1.id, 
a1.person_id, 
UNIX_TIMESTAMP( a1.actual_start ) AS a1start,
UNIX_TIMESTAMP( a1.actual_end ) AS a1end, 
UNIX_TIMESTAMP( a2.actual_start ) AS a2start, 
UNIX_TIMESTAMP( a2.actual_end ) AS a2end
FROM activity a1
INNER JOIN activity a2
WHERE (
a1.id < a2.id
AND a1.person_date = a2.person_date
AND a1.actual_start < a2.actual_end
AND a1.actual_end > a2.actual_start

Firstly, because my users record their time to hourly boundaries a lot and we have '0:00' overlap then, secondly with a1.id<>a2.id, we find every entry twice, when we need it only one.

I have concluded that there is no good way to optimize for a1.id < a2.id, so with person_date I'm optimizing for something else.

answered Nov 10, 2014 at 1:16
1
  • You also have an index starting with person_date? Commented Dec 15, 2016 at 19:46

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.