0

I have a query that's behaving a bit oddly. In my database I have a table called "records". It tells me a bunch of information about what applications a user ran on my company's machines. I'm trying to aggregate some statistics, but am having some odd issues with a query.

This query runs in about 6.5 minutes (~30 million entries in "records"). I would expect it to take longer when divisionName isn't specified, but it seems to be taking an unreasonable amount of time to finish (overnight and still chugging).

select divisionName, programName, count(usageID) 
 from records R 
 right join Programs P 
 on P.programID=R.usageProgramID 
 right join locate L 
 on L.computerID=R.usageComputerID 
 where divisionName="umbrella"
 group by programName
 order by programName asc
 INTO OUTFILE '/tmp/lab_prog_umbrella.csv'
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n';

Is there an alternate structure to speed up the query? I have an index on (computerID,divisionName) in locate and (programID,programName) in Programs as well as a multitude of indexes in records.

Note: Programs contains 4 fields and locate contains 2. I don't think the joins are exceptionally large.

Edit:

Explain:

+----+-------------+-------+------+-----------------+-----------+---------+----------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------+-----------+---------+----------------------+------+----------------------------------------------+
| 1 | SIMPLE | L | ref | loc | loc | 27 | const | 1195 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | R | ref | uprog,computers | computers | 34 | scf.L.computerID | 1627 | |
| 1 | SIMPLE | P | ref | pid_name | pid_name | 43 | scf.R.usageProgramID | 1 | Using index |
+----+-------------+-------+------+-----------------+-----------+---------+----------------------+------+----------------------------------------------+

Records Description:

+-----------------+-------------+------+-----+---------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------------------+-------+
| usageID | varchar(24) | NO | PRI | NULL | |
| usageWhen | datetime | NO | PRI | 0000-00-00 00:00:00 | |
| usageEnum | int(11) | YES | | NULL | |
| usageServerID | int(11) | YES | | NULL | |
| usageServerType | int(11) | YES | | NULL | |
| usageProgramID | varchar(40) | NO | PRI | | |
| usageLicenseID | varchar(18) | YES | | NULL | |
| usageComputerID | varchar(31) | YES | MUL | NULL | |
| usageExpansion | varchar(0) | YES | | NULL | |
| usageUser | varchar(31) | YES | MUL | NULL | |
| usageAddress | varchar(28) | YES | | NULL | |
| usageGroup | varchar(16) | YES | | NULL | |
| usageEvent | int(11) | YES | | NULL | |
| usageReason | int(11) | YES | | NULL | |
| usageTime | int(11) | YES | | NULL | |
| usageOtherTime | varchar(25) | YES | | NULL | |
| usageGMTOffset | int(11) | YES | | NULL | |
| usageCount | int(11) | YES | | NULL | |
+-----------------+-------------+------+-----+---------------------+-------+

Locate Description:

+--------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-------------+------+-----+---------+-------+
| computerID | varchar(31) | YES | MUL | NULL | |
| divisionName | varchar(24) | YES | MUL | NULL | |
+--------------+-------------+------+-----+---------+-------+

Programs Description:

+----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| programID | varchar(40) | YES | MUL | NULL | |
| programName | varchar(63) | YES | MUL | NULL | |
| programVersion | varchar(31) | YES | | NULL | |
| category | varchar(30) | YES | | NULL | |
+----------------+-------------+------+-----+---------+-------+
asked Aug 13, 2013 at 17:15
7
  • Writing to disc is much slower than reading from memory, that could be the bottleneck. How long does the query take if you output to console instead of a CSV? Commented Aug 13, 2013 at 22:03
  • Your index probably won't be used since you're not filtering on computerID. This is discussed on SO: stackoverflow.com/questions/179085/…. What does EXPLAIN PLAN show you about this query? Commented Aug 13, 2013 at 22:04
  • One last note: your query would be easier to read, and harder to accidentally break, with consistent aliases. Having to look between the query and your text to discover that divisionName is part of locate and programName is a field of Programs is needless friction. Commented Aug 13, 2013 at 22:09
  • The bottle next isn't due to the file IO. After hours of running it doesn't actually have anything in the /tmp file. About database: Trust me, this thing makes my eyes bleed. It's an inherited database and has worse faults than you see... In the schema every field starts with the table name. It's called divisionName because I created a temporary table so I didn't have to join Divisions->Computer->records just to get the login location. Commented Aug 13, 2013 at 22:13
  • Legacy systems can be very painful. So locate is a temp table? Is populating it part of the long run time, or just the query you're showing? Diagnosing a problem script is much like any other technical problem: strip out elements until you've isolated the specific problem. With that in mind, what does EXPLAIN PLAN say? Does it run OK if you just include Programs and locate? If so, you might want to aggregate records and then join to the remaining two tables. Commented Aug 13, 2013 at 22:44

1 Answer 1

1
  • Create foreign keys from RECORDS to PROGRAMS ans LOCATE ( you don't mention if they exist ).
  • Use LEFT JOIN instead of RIGHT JOIN. After all RECORDS is the "strong" table in this query.
  • Group by R.usageProgramID instead of by ProgramName.

select divisionName, programName, count(usageID) 
 from records R 
 left join Programs P 
 on P.programID=R.usageProgramID 
 left join locate L 
 on L.computerID=R.usageComputerID 
 where divisionName="umbrella"
 group by R.usageProgramID 
 order by programName asc

Another alternative is to try this:

select
 t.divisionName, P.programName, count(*) as total
from (
 select L.divisionName, R.usageComputerID
 from records R 
 left join locate L 
 on L.computerID=R.usageComputerID 
 where L.divisionName="umbrella"
 ) t 
 left join Programs P 
 on P.programID=t.usageProgramID 
group by
 group by P.programName
 order by P.programName asc

Since the absence of FK maybe not helping.

answered Aug 14, 2013 at 16:18
7
  • Not possible. The tables aren't setup with support for foreign keys so that's out. And each program has dozens of id's. For some reason it was setup in a way that gives each version of an application a unique ID. Chrome for example has a few hundred... Commented Aug 14, 2013 at 16:26
  • @Jacobm001 Will not different versions of an application have unique program names anyway ? Also, are R.usageProgramID and R.usageComputerID indexed ? Commented Aug 14, 2013 at 16:30
  • No, the programNames are condensed and are not unique. There is programVersion in the Program table that correlates what version it's at. Yes the two fields are indexed. Commented Aug 14, 2013 at 16:32
  • @Jacobm001 Try the left join. The right join is doing things the other way around. Commented Aug 14, 2013 at 16:35
  • Will do. I'll let you know how it goes. Commented Aug 14, 2013 at 16:36

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.