6

I have been trying to create a "gaps and islands" identification routine in MySQL. See the previous question regarding setup here and computational resources required here.

I increased the RAM of the database server to 4 Gb. The query was running ok, but now I am having an issue where the database server runs out of disk space while running the query.

The query looks at the entire database and inserts islands that it finds into a table called shutdown_events. I use INSERT IGNORE and look at the whole database so that I don't cut off islands that span some sort of a timestamp constraint (I try adding this in, read on..).

I created the view with the following SQL (there are ~30 dr.*_sd values to include)

create view sd_pivot as
 select dr.wellsite_id, 'EngOilP_sd' as sd, dr.timestamp, dr.EngOilP_sd as val from datarecords dr
 union all
 select dr.wellsite_id, 'Stg1ScrbLVL_sd', dr.timestamp, dr.Stg1ScrbLVL_sd from datarecords dr
 union all
 select dr.wellsite_id, 'Stg2ScrbLVL_sd', dr.timestamp, dr.Stg2ScrbLVL_sd from datarecords dr
 union all
...

and then the SQL that populates the shutdownevents table is here:

INSERT IGNORE INTO shutdownevents (wellsite_id, sd_name, start, end)
SELECT t.*
FROM (
 SELECT wellsite_id, sd, MIN(timestamp) AS starttime, MAX(timestamp) AS endtime
 FROM (
 SELECT
 sd_p.*
 , @val_change := IF(@prev_val != sd_p.val, @val_change + 1, @val_change) AS vc
 , @prev_val := sd_p.val
 FROM
 sd_pivot sd_p
 , (SELECT @prev_val := NULL, @val_change := 0) var_init_subquery
 ORDER BY wellsite_id, sd, timestamp
 ) sq
 WHERE val = 1
 # AND timestamp > '{two_weeks_ago}'
 GROUP BY sd, vc
) t
ORDER BY wellsite_id, sd, starttime

In my testing database, I added the line (commented out in the above SQL) AND timestamp > '{two_weeks_ago}' to only find islands that occur after that date. Note that I sacrifice the beginning of islands that span that date. I thought that this would prevent the query from trying to look at so much data and thus using so much disk space.

The problem is that trying to run the above query still eventually uses up all of the available disk space and subsequently fails. The database server has 30 Gb of disk space. Before I run the query, there is ~2 Gb used up (28 Gb free).

The database is about 1.5 Gb in size. 99.99% of the database size is the datarecords table. It has> 2M rows, the other tables are a few hundred rows at most.

Here is a shot from htop on the database server once the disk has run out of memory and the query is about to fail.

htop right before fail

The actual disk space is being taken up in the folder /tmp and the file names are like below

-rw-rw---- 1 mysql mysql 5.9G Nov 30 16:40 #sql_3a1_0.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_0.MAI
-rw-rw---- 1 mysql mysql 2.6G Nov 30 17:09 #sql_3a1_1.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:40 #sql_3a1_1.MAI
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_3.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_3.MAI
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_5.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_5.MAI
-rw-rw---- 1 mysql mysql 5.9G Nov 30 16:44 #sql_3a1_6.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:29 #sql_3a1_6.MAI
-rw-rw---- 1 mysql mysql 2.3G Nov 30 17:09 #sql_3a1_7.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:45 #sql_3a1_7.MAI
-rw-rw---- 1 mysql mysql 5.9G Nov 30 16:45 #sql_3a1_8.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:30 #sql_3a1_8.MAI
-rw-rw---- 1 mysql mysql 2.1G Nov 30 17:09 #sql_3a1_9.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:45 #sql_3a1_9.MAI

If I $ service mysql restart, all of the above files disappear and I get my disk space back.

So, I this question has a couple answers addressing one or more of the following;

  • How might I rewrite the query to not be so resource intensive?
  • Is there a different way to approach this problem?

At the end of the day, I simply need to identify events that occur and their duration using their respective boolean column that updates every few minutes

EDIT

Here is a link to the EXPLAIN output of the query. On the second tab is a DESCRIBE of datarecords

The line where I attempt to limit the date range of the query is calculated in the python script. The actual SQL to the database ends up being

...
AND timestamp > '2015-11-20'
...
asked Nov 30, 2015 at 22:25
7
  • @ypercube The view has 38 union all statements in total Commented Dec 3, 2015 at 16:00
  • @Stoleg This is exactly what I did with a multipb db that would run out of space. Just made smaller loops until it worked and optimized the joins. Commented Dec 4, 2015 at 16:40
  • Your view has 38 unions doing pivot of original table, but you are using only 4-5 columns. Can you create another view with only coluns you need for this exercise? Commented Dec 4, 2015 at 16:43
  • @Stoleg there is an index on the timestamp column in the datarecords table. Concerning the unions, can you clarify about only 4-5 columns being used? there are 38 different values I am trying to identify islands for. Would you be able to point me to some resources that would help me to better construct this query? I built it using guidance from some of my previous questions and TBH some of its aspects are beyond my understanding (like what you mention concerning the full outer join and nesting). I also don't fully understand your last comment. Commented Dec 7, 2015 at 18:12
  • @AliRazeghi, the subselect might not need the SELECT t.* and ORDER BY, I am still trying to figure out the query itself. There may be artifacts of things I have been trying still included. I tried to make a view for each type of the 38 sd values, although that did not successfully execute (eventually killed mysqld). What do you mean when you say "optimize the joins"? What about them is not optimal (I do not understand this aspect fully) Commented Dec 7, 2015 at 18:24

1 Answer 1

3
+50

Your problem is due to 2 reported bugs:

  1. When using UNION or UNION ALL MySQL creates a temporary working table. While it is justifiable for UNION, it is unnecessary for UNION ALL

mysqld uses a temp table for both UNION and UNION ALL processing. The temp table for UNION ALL is not needed as results can be immediately returned to the client. That change would save the cost of writing the temp table and possibly spilling it to disk. It would also return the first-N rows to the client much faster.

Bug #50674: Do not create temporary tables for UNION ALL

This bug was scheduled to be fixed in ver. 5.7.3.

  1. Furthermore, self-join tables are reopened for every call:

In self joins, unions, etc. - tables are re-opened for each mention.

If a UNION (or self-joins) have e.g. 1000 mentions of same table, then 1000 internal structures will be created, with additional file descriptors and buffers per each.

This leads to a simple 100k-sized query to allocate 500MB of memory just by mentioning same (even empty) table in it (e.g. SELECT * FROM table UNION SELECT * FROM table ...).

Bug #44626: Tables are reopened multiple times if used in same query, buffers allocated too

So in your case it creates 38 temporary tables consuming lots of resources.

One possible solution

is to use dynamic SQL and CURSOR to go through every of 38 columns. This way you don't need a view with 38 UNION ALL statements.

answered Dec 8, 2015 at 8:55
7
  • do you have any resources you could point me to that would get me started on how to use dynamic SQL as you mentioned in your possible solution? Commented Dec 10, 2015 at 19:20
  • First of all make a vary simple query working. If a query looking at a single sd does not run there is not going going further. Would you consider loading your data to another hosting like Azure or Amazon? Commented Dec 11, 2015 at 10:08
  • the query runs on a small test data set. What advantages would hosting the data in a different place be? Commented Dec 11, 2015 at 15:56
  • Could hosting would give you much more processing power than you have. I'm used to prod machines with at least 16 cores, 64GB RAM, 1TB space. Cloud gives you more processing power even for cheapest/free plans. Commented Dec 11, 2015 at 16:42
  • Do you have an incrementing identity column in core table? You can do a join on i=i+1 to align sequential time periods and see when status changes from 0 to 1 and back. It does not require ordering so my fit your resource limitations. Commented Dec 11, 2015 at 16:44

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.