I have been trying to create a "gaps and islands" identification routine in MySQL. See the previous question regarding setup here and computational resources required here.
I increased the RAM of the database server to 4 Gb. The query was running ok, but now I am having an issue where the database server runs out of disk space while running the query.
The query looks at the entire database and inserts islands that it finds into a table called shutdown_events
. I use INSERT IGNORE
and look at the whole database so that I don't cut off islands that span some sort of a timestamp constraint (I try adding this in, read on..).
I created the view with the following SQL (there are ~30 dr.*_sd
values to include)
create view sd_pivot as
select dr.wellsite_id, 'EngOilP_sd' as sd, dr.timestamp, dr.EngOilP_sd as val from datarecords dr
union all
select dr.wellsite_id, 'Stg1ScrbLVL_sd', dr.timestamp, dr.Stg1ScrbLVL_sd from datarecords dr
union all
select dr.wellsite_id, 'Stg2ScrbLVL_sd', dr.timestamp, dr.Stg2ScrbLVL_sd from datarecords dr
union all
...
and then the SQL that populates the shutdownevents
table is here:
INSERT IGNORE INTO shutdownevents (wellsite_id, sd_name, start, end)
SELECT t.*
FROM (
SELECT wellsite_id, sd, MIN(timestamp) AS starttime, MAX(timestamp) AS endtime
FROM (
SELECT
sd_p.*
, @val_change := IF(@prev_val != sd_p.val, @val_change + 1, @val_change) AS vc
, @prev_val := sd_p.val
FROM
sd_pivot sd_p
, (SELECT @prev_val := NULL, @val_change := 0) var_init_subquery
ORDER BY wellsite_id, sd, timestamp
) sq
WHERE val = 1
# AND timestamp > '{two_weeks_ago}'
GROUP BY sd, vc
) t
ORDER BY wellsite_id, sd, starttime
In my testing database, I added the line (commented out in the above SQL) AND timestamp > '{two_weeks_ago}'
to only find islands that occur after that date. Note that I sacrifice the beginning of islands that span that date. I thought that this would prevent the query from trying to look at so much data and thus using so much disk space.
The problem is that trying to run the above query still eventually uses up all of the available disk space and subsequently fails. The database server has 30 Gb of disk space. Before I run the query, there is ~2 Gb used up (28 Gb free).
The database is about 1.5 Gb in size. 99.99% of the database size is the datarecords
table. It has> 2M rows, the other tables are a few hundred rows at most.
Here is a shot from htop
on the database server once the disk has run out of memory and the query is about to fail.
The actual disk space is being taken up in the folder /tmp
and the file names are like below
-rw-rw---- 1 mysql mysql 5.9G Nov 30 16:40 #sql_3a1_0.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_0.MAI
-rw-rw---- 1 mysql mysql 2.6G Nov 30 17:09 #sql_3a1_1.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:40 #sql_3a1_1.MAI
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_3.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_3.MAI
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_5.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_5.MAI
-rw-rw---- 1 mysql mysql 5.9G Nov 30 16:44 #sql_3a1_6.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:29 #sql_3a1_6.MAI
-rw-rw---- 1 mysql mysql 2.3G Nov 30 17:09 #sql_3a1_7.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:45 #sql_3a1_7.MAI
-rw-rw---- 1 mysql mysql 5.9G Nov 30 16:45 #sql_3a1_8.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:30 #sql_3a1_8.MAI
-rw-rw---- 1 mysql mysql 2.1G Nov 30 17:09 #sql_3a1_9.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:45 #sql_3a1_9.MAI
If I $ service mysql restart
, all of the above files disappear and I get my disk space back.
So, I this question has a couple answers addressing one or more of the following;
- How might I rewrite the query to not be so resource intensive?
- Is there a different way to approach this problem?
At the end of the day, I simply need to identify events that occur and their duration using their respective boolean column that updates every few minutes
EDIT
Here is a link to the EXPLAIN
output of the query. On the second tab is a DESCRIBE
of datarecords
The line where I attempt to limit the date range of the query is calculated in the python
script. The actual SQL to the database ends up being
...
AND timestamp > '2015-11-20'
...
1 Answer 1
Your problem is due to 2 reported bugs:
- When using
UNION
orUNION ALL
MySQL creates a temporary working table. While it is justifiable forUNION
, it is unnecessary forUNION ALL
mysqld uses a temp table for both UNION and UNION ALL processing. The temp table for UNION ALL is not needed as results can be immediately returned to the client. That change would save the cost of writing the temp table and possibly spilling it to disk. It would also return the first-N rows to the client much faster.
Bug #50674: Do not create temporary tables for UNION ALL
This bug was scheduled to be fixed in ver. 5.7.3.
- Furthermore, self-join tables are reopened for every call:
In self joins, unions, etc. - tables are re-opened for each mention.
If a UNION (or self-joins) have e.g. 1000 mentions of same table, then 1000 internal structures will be created, with additional file descriptors and buffers per each.
This leads to a simple 100k-sized query to allocate 500MB of memory just by mentioning same (even empty) table in it (e.g. SELECT * FROM table UNION SELECT * FROM table ...).
Bug #44626: Tables are reopened multiple times if used in same query, buffers allocated too
So in your case it creates 38 temporary tables consuming lots of resources.
One possible solution
is to use dynamic SQL and CURSOR
to go through every of 38 columns. This way you don't need a view with 38 UNION ALL
statements.
-
do you have any resources you could point me to that would get me started on how to use dynamic SQL as you mentioned in your possible solution?Brian Leach– Brian Leach2015年12月10日 19:20:40 +00:00Commented Dec 10, 2015 at 19:20
-
First of all make a vary simple query working. If a query looking at a single sd does not run there is not going going further. Would you consider loading your data to another hosting like Azure or Amazon?Stoleg– Stoleg2015年12月11日 10:08:58 +00:00Commented Dec 11, 2015 at 10:08
-
the query runs on a small test data set. What advantages would hosting the data in a different place be?Brian Leach– Brian Leach2015年12月11日 15:56:59 +00:00Commented Dec 11, 2015 at 15:56
-
Could hosting would give you much more processing power than you have. I'm used to prod machines with at least 16 cores, 64GB RAM, 1TB space. Cloud gives you more processing power even for cheapest/free plans.Stoleg– Stoleg2015年12月11日 16:42:16 +00:00Commented Dec 11, 2015 at 16:42
-
Do you have an incrementing identity column in core table? You can do a join on
i=i+1
to align sequential time periods and see when status changes from 0 to 1 and back. It does not require ordering so my fit your resource limitations.Stoleg– Stoleg2015年12月11日 16:44:53 +00:00Commented Dec 11, 2015 at 16:44
Explore related questions
See similar questions with these tags.
datarecords
table. Concerning the unions, can you clarify about only 4-5 columns being used? there are 38 different values I am trying to identify islands for. Would you be able to point me to some resources that would help me to better construct this query? I built it using guidance from some of my previous questions and TBH some of its aspects are beyond my understanding (like what you mention concerning the full outer join and nesting). I also don't fully understand your last comment.SELECT t.*
andORDER BY
, I am still trying to figure out the query itself. There may be artifacts of things I have been trying still included. I tried to make a view for each type of the 38 sd values, although that did not successfully execute (eventually killedmysqld
). What do you mean when you say "optimize the joins"? What about them is not optimal (I do not understand this aspect fully)