Count consecutive null rows from a joined table

Question 1

I have the following tables:

members:
----------------------------------------------
| member_id | member_name | member_join_date |
---------------------------------------------- 
events:
-------------------------
| event_id | event_name |
-------------------------
event_dates:
--------------------------------------
| event_id | event_date | event_week |
--------------------------------------
attendance:
------------------------------------------------------------
| member_id | event_id | attendance_date | attendance_week |
------------------------------------------------------------

So I have members who joined the organization at a certain time and they may attend weekly events (the occurrences of which are monitored by the event_dates table)

What I want to do is get all members that have at least 3-consecutive absences

I did all I could and I think I have each part of the query but don't know how to merge them together.

Get all events that they could've attended:

 SELECT
 m.member_id AS id,
 ed.event_id,
 ed.event_week
 FROM
 members AS m
 LEFT JOIN
 event_dates AS ed
 ON
 ed.event_date >= m.member_join_date

Get all the events they actually attended:

 SELECT
 m.member_id AS id,
 a.event_id
 FROM
 members AS m
 LEFT JOIN
 attendance AS a
 ON
 a.member_id = m.member_id

And probably the answer to this question if I manage to merge the two queries into one:

https://stackoverflow.com/questions/1321670/count-of-consecutive-not-null-values

But this time I want to count NULL values

Question 2

This is essentially a gaps-and-islands problem. And when I have my SQL Server hat on, I often solve this kind of problem with two ROW_NUMBER() calls. Sadly, MySQL, unlike many other major SQL products, does not support ROW_NUMBER(), nor any other ranking function. To make up for that, however, you can use variable assignment in SELECTs, which MySQL does support (unlike many other major SQL products).

Below is a solution followed by an explanation:

SELECT
 member_id,
 member_name,
 event_id,
 COUNT(*) AS consecutive_times_missed,
 MIN(event_date) AS first_date_missed,
 MAX(event_date) AS last_date_missed
FROM (
 SELECT
 member_id,
 member_name,
 event_id,
 event_date,
 is_missed,
 @occ_ranking := (event_id = @last_event) * (member_id = @last_member) * @occ_ranking + 1,
 @att_ranking := (event_id = @last_event) * (member_id = @last_member)
 * (is_missed = @last_missed) * @att_ranking + 1,
 @occ_ranking - @att_ranking AS grp,
 @last_member := member_id,
 @last_event := event_id,
 @last_missed := is_missed
 FROM (
 SELECT
 m.member_id,
 m.member_name,
 e.event_id,
 e.event_date,
 (a.attendance_date IS NULL) AS is_missed
 FROM members m
 INNER JOIN event_dates e ON m.member_join_date <= e.event_date
 LEFT JOIN attendance a ON m.member_id = a.member_id
 AND e.event_id = a.event_id
 AND e.event_date = a.attendance_date,
 (
 SELECT
 @occ_ranking := 0,
 @att_ranking := 0,
 @last_member := 0,
 @last_event := 0,
 @last_missed := 0
 ) v
 ORDER BY
 m.member_id,
 e.event_date
 ) s
) s
WHERE
 is_missed = 1
GROUP BY
 member_id,
 member_name,
 event_id,
 grp
HAVING
 COUNT(*) >= 3
;

Basically, you start with joining members and event_dates to get all the event occurrences the members could have attended based on their membership dates. Then you throw in the attendance table (via a left join) to flag the missed occurrences. Here's an example of what you get by this time:

member_id event_id event_date is_missed
--------- -------- ---------- ---------
1 1 2012年07月10日 0
2 1 2012年07月10日 1
1 1 2012年07月14日 0
2 1 2012年07月14日 1
... ... ... ...

At this point, the resulting set needs to be sorted by member_id, event_id, because that is crucial to the subsequent rankings calculation.

Two different rankings are produced for every row. One is a ranking within the row's partition of (member_id, event_id) (it is reset as soon as a new event or a new member is encountered). The other is a ranking within the specific group of consecutive event occurrences, either attended or missed, that the row belongs to (this ranking, in addition to being reset upon coming across a new member or event, also gets reset whenever the other group is encountered). And so you get something like this:

member_id event_id event_date is_missed ranking1 ranking2
--------- -------- ---------- --------- -------- --------
... ... ... ... ... ...
1 1 2012年07月27日 0 4 4
1 1 2012年07月28日 0 5 5
1 1 2012年07月29日 1 6 1
2 1 2012年07月10日 1 1 1
2 1 2012年07月14日 1 2 2
2 1 2012年07月25日 1 3 3
2 1 2012年07月27日 0 4 1
... ... ... ... ... ...

As you may have noticed, the difference between the two ranking numbers is constant throughout the particular group of consecutive event occurrences of the same kind ("attended" or "missed") and is also unique for that group within its partition. Therefore, every such group can be identified by member_id, event_id and the just mentioned difference.

And it now remains simply to filter the events leaving just the missed ones, group the rows and get the necessary aggregated data, like the number of rows and, possibly, as in the query above, the dates of the first and the last event occurrence in the group. The number of rows is also used in an additional filter condition to omit groups with fewer rows than required.

Question 3

Sorry it took a while to select the answer, I was modifying the query a bit more. I noticed that derived tables do not use indeces though. So if the lowest level JOIN returns a LOT of rows I'd take a big performance hit right?

Question 4

@RolandoCruz: Seems probable. Not sure, however, what index or indices to suggest in this situation to make sure they would be efficiently used (not very good at query tuning yet, sorry). If I had to tune this myself, I'd likely be taking into account the GROUP BY columns as well as the ORDER BY ones, although either would seem to me a tricky bit to account for, as those columns belong to different tables.

Question 5

(Assuming that the (event_id, event_date) is the primary or a unique key in event_dates table):

SELECT
 m.member_id AS id,
 ed_start.event_id,
 ed_start.event_date AS miss_start,
 ed_end.event_date AS miss_end,
 COUNT(*) AS consecutive_misses
FROM
 members AS m
 JOIN
 event_dates AS ed_start
 ON
 ed_start.event_date >= m.member_join_date
 JOIN
 event_dates AS ed_end
 ON ed_end.event_id = ed_start.event_id
 AND ed_end.event_date >= ed_start.event_date
 JOIN
 event_dates AS ed_mid
 ON ed_mid.event_id = ed_start.event_id
 AND ed_mid.event_date >= ed_start.event_date
 AND ed_mid.event_date <= ed_end.event_date
 LEFT JOIN
 attendance AS a
 ON a.member_id = m.member_id
 AND a.event_id = ed_mid.event_id
 AND a.attendance_date = ed_mid.event_date
GROUP BY 
 m.member_id,
 ed_start.event_id,
 ed_start.event_date,
 ed_end.event_date
HAVING
 COUNT(*) >= 3
 AND
 COUNT(a.attendance_date) = 0 ;

Test in SQL-Fiddle

The above may give (too many) multiple results if the consecutive misses are more than 3. So, If you want only the members and events they missed but not when they missed attendance, replace the SELECT with (test: SQL-Fiddle-2 ):

SELECT DISTINCT
 m.member_id AS id,
 ed.event_id,
FROM
 ...

You could also wrap the query (as a derived table) inside an external query and do another grouping for this to work (test: SQL-Fiddle-3 )

And if you want only the members and not the events they missed nor when they missed attendance:

SELECT DISTINCT
 m.member_id AS id
FROM
 ...

Question 6

Hmmmm.. It returns 0 rows :) sqlfiddle.com/#!2/3f509/1/0

Question 7

@Rolando: Yes, wrong column in the COUNT()=0. Corrected now.

Question 8

Haven't yet figured out how this works, but it doesn't seem to find the employees who missed the first 3 occurrences of an event. (In that example, I expected the employee ID=2 to be returned.)

Question 9

@Andriy: Yeah, it was wrong. It needed (one more!) join. I don't expect this to be efficient. No idea how it could be rewritten in an elegant manner ...

Question 10

And if the misses are more than 3 in a row, it returns too many results. It needs to be embedded as a derived table to get those out (or use the last 2 DISTINCT versions)

Andriy M Andriy M 23.3k6 gold badges60 silver badges104 bronze badges · Accepted Answer · 2012-09-26 20:00:03Z

This is essentially a gaps-and-islands problem. And when I have my SQL Server hat on, I often solve this kind of problem with two ROW_NUMBER() calls. Sadly, MySQL, unlike many other major SQL products, does not support ROW_NUMBER(), nor any other ranking function. To make up for that, however, you can use variable assignment in SELECTs, which MySQL does support (unlike many other major SQL products).

Below is a solution followed by an explanation:

SELECT
 member_id,
 member_name,
 event_id,
 COUNT(*) AS consecutive_times_missed,
 MIN(event_date) AS first_date_missed,
 MAX(event_date) AS last_date_missed
FROM (
 SELECT
 member_id,
 member_name,
 event_id,
 event_date,
 is_missed,
 @occ_ranking := (event_id = @last_event) * (member_id = @last_member) * @occ_ranking + 1,
 @att_ranking := (event_id = @last_event) * (member_id = @last_member)
 * (is_missed = @last_missed) * @att_ranking + 1,
 @occ_ranking - @att_ranking AS grp,
 @last_member := member_id,
 @last_event := event_id,
 @last_missed := is_missed
 FROM (
 SELECT
 m.member_id,
 m.member_name,
 e.event_id,
 e.event_date,
 (a.attendance_date IS NULL) AS is_missed
 FROM members m
 INNER JOIN event_dates e ON m.member_join_date <= e.event_date
 LEFT JOIN attendance a ON m.member_id = a.member_id
 AND e.event_id = a.event_id
 AND e.event_date = a.attendance_date,
 (
 SELECT
 @occ_ranking := 0,
 @att_ranking := 0,
 @last_member := 0,
 @last_event := 0,
 @last_missed := 0
 ) v
 ORDER BY
 m.member_id,
 e.event_date
 ) s
) s
WHERE
 is_missed = 1
GROUP BY
 member_id,
 member_name,
 event_id,
 grp
HAVING
 COUNT(*) >= 3
;

Basically, you start with joining members and event_dates to get all the event occurrences the members could have attended based on their membership dates. Then you throw in the attendance table (via a left join) to flag the missed occurrences. Here's an example of what you get by this time:

member_id event_id event_date is_missed
--------- -------- ---------- ---------
1 1 2012年07月10日 0
2 1 2012年07月10日 1
1 1 2012年07月14日 0
2 1 2012年07月14日 1
... ... ... ...

At this point, the resulting set needs to be sorted by member_id, event_id, because that is crucial to the subsequent rankings calculation.

Two different rankings are produced for every row. One is a ranking within the row's partition of (member_id, event_id) (it is reset as soon as a new event or a new member is encountered). The other is a ranking within the specific group of consecutive event occurrences, either attended or missed, that the row belongs to (this ranking, in addition to being reset upon coming across a new member or event, also gets reset whenever the other group is encountered). And so you get something like this:

member_id event_id event_date is_missed ranking1 ranking2
--------- -------- ---------- --------- -------- --------
... ... ... ... ... ...
1 1 2012年07月27日 0 4 4
1 1 2012年07月28日 0 5 5
1 1 2012年07月29日 1 6 1
2 1 2012年07月10日 1 1 1
2 1 2012年07月14日 1 2 2
2 1 2012年07月25日 1 3 3
2 1 2012年07月27日 0 4 1
... ... ... ... ... ...

As you may have noticed, the difference between the two ranking numbers is constant throughout the particular group of consecutive event occurrences of the same kind ("attended" or "missed") and is also unique for that group within its partition. Therefore, every such group can be identified by member_id, event_id and the just mentioned difference.

And it now remains simply to filter the events leaving just the missed ones, group the rows and get the necessary aggregated data, like the number of rows and, possibly, as in the query above, the dates of the first and the last event occurrence in the group. The number of rows is also used in an additional filter condition to omit groups with fewer rows than required.

Sorry it took a while to select the answer, I was modifying the query a bit more. I noticed that derived tables do not use indeces though. So if the lowest level JOIN returns a LOT of rows I'd take a big performance hit right?
@RolandoCruz: Seems probable. Not sure, however, what index or indices to suggest in this situation to make sure they would be efficiently used (not very good at query tuning yet, sorry). If I had to tune this myself, I'd likely be taking into account the GROUP BY columns as well as the ORDER BY ones, although either would seem to me a tricky bit to account for, as those columns belong to different tables.

Stack Exchange Network

Count consecutive null rows from a joined table

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Count consecutive null rows from a joined table

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions