Query Optimization for added value

Question 1

I have a query that's behaving a bit oddly. In my database I have a table called "records". It tells me a bunch of information about what applications a user ran on my company's machines. I'm trying to aggregate some statistics, but am having some odd issues with a query.

This query runs in about 6.5 minutes (~30 million entries in "records"). I would expect it to take longer when divisionName isn't specified, but it seems to be taking an unreasonable amount of time to finish (overnight and still chugging).

select divisionName, programName, count(usageID) 
 from records R 
 right join Programs P 
 on P.programID=R.usageProgramID 
 right join locate L 
 on L.computerID=R.usageComputerID 
 where divisionName="umbrella"
 group by programName
 order by programName asc
 INTO OUTFILE '/tmp/lab_prog_umbrella.csv'
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n';

Is there an alternate structure to speed up the query? I have an index on (computerID,divisionName) in locate and (programID,programName) in Programs as well as a multitude of indexes in records.

Note: Programs contains 4 fields and locate contains 2. I don't think the joins are exceptionally large.

Edit:

Explain:

+----+-------------+-------+------+-----------------+-----------+---------+----------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------+-----------+---------+----------------------+------+----------------------------------------------+
| 1 | SIMPLE | L | ref | loc | loc | 27 | const | 1195 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | R | ref | uprog,computers | computers | 34 | scf.L.computerID | 1627 | |
| 1 | SIMPLE | P | ref | pid_name | pid_name | 43 | scf.R.usageProgramID | 1 | Using index |
+----+-------------+-------+------+-----------------+-----------+---------+----------------------+------+----------------------------------------------+

Records Description:

+-----------------+-------------+------+-----+---------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------------------+-------+
| usageID | varchar(24) | NO | PRI | NULL | |
| usageWhen | datetime | NO | PRI | 0000-00-00 00:00:00 | |
| usageEnum | int(11) | YES | | NULL | |
| usageServerID | int(11) | YES | | NULL | |
| usageServerType | int(11) | YES | | NULL | |
| usageProgramID | varchar(40) | NO | PRI | | |
| usageLicenseID | varchar(18) | YES | | NULL | |
| usageComputerID | varchar(31) | YES | MUL | NULL | |
| usageExpansion | varchar(0) | YES | | NULL | |
| usageUser | varchar(31) | YES | MUL | NULL | |
| usageAddress | varchar(28) | YES | | NULL | |
| usageGroup | varchar(16) | YES | | NULL | |
| usageEvent | int(11) | YES | | NULL | |
| usageReason | int(11) | YES | | NULL | |
| usageTime | int(11) | YES | | NULL | |
| usageOtherTime | varchar(25) | YES | | NULL | |
| usageGMTOffset | int(11) | YES | | NULL | |
| usageCount | int(11) | YES | | NULL | |
+-----------------+-------------+------+-----+---------------------+-------+

Locate Description:

+--------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-------------+------+-----+---------+-------+
| computerID | varchar(31) | YES | MUL | NULL | |
| divisionName | varchar(24) | YES | MUL | NULL | |
+--------------+-------------+------+-----+---------+-------+

Programs Description:

+----------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+-------+
| programID | varchar(40) | YES | MUL | NULL | |
| programName | varchar(63) | YES | MUL | NULL | |
| programVersion | varchar(31) | YES | | NULL | |
| category | varchar(30) | YES | | NULL | |
+----------------+-------------+------+-----+---------+-------+

Question 2

Writing to disc is much slower than reading from memory, that could be the bottleneck. How long does the query take if you output to console instead of a CSV?

Question 3

Your index probably won't be used since you're not filtering on computerID. This is discussed on SO: stackoverflow.com/questions/179085/…. What does EXPLAIN PLAN show you about this query?

Question 4

One last note: your query would be easier to read, and harder to accidentally break, with consistent aliases. Having to look between the query and your text to discover that divisionName is part of locate and programName is a field of Programs is needless friction.

Question 5

The bottle next isn't due to the file IO. After hours of running it doesn't actually have anything in the /tmp file. About database: Trust me, this thing makes my eyes bleed. It's an inherited database and has worse faults than you see... In the schema every field starts with the table name. It's called divisionName because I created a temporary table so I didn't have to join Divisions->Computer->records just to get the login location.

Question 6

Legacy systems can be very painful. So locate is a temp table? Is populating it part of the long run time, or just the query you're showing? Diagnosing a problem script is much like any other technical problem: strip out elements until you've isolated the specific problem. With that in mind, what does EXPLAIN PLAN say? Does it run OK if you just include Programs and locate? If so, you might want to aggregate records and then join to the remaining two tables.

Question 7

Create foreign keys from RECORDS to PROGRAMS ans LOCATE ( you don't mention if they exist ).
Use LEFT JOIN instead of RIGHT JOIN. After all RECORDS is the "strong" table in this query.
Group by R.usageProgramID instead of by ProgramName.

select divisionName, programName, count(usageID) 
 from records R 
 left join Programs P 
 on P.programID=R.usageProgramID 
 left join locate L 
 on L.computerID=R.usageComputerID 
 where divisionName="umbrella"
 group by R.usageProgramID 
 order by programName asc

Another alternative is to try this:

select
 t.divisionName, P.programName, count(*) as total
from (
 select L.divisionName, R.usageComputerID
 from records R 
 left join locate L 
 on L.computerID=R.usageComputerID 
 where L.divisionName="umbrella"
 ) t 
 left join Programs P 
 on P.programID=t.usageProgramID 
group by
 group by P.programName
 order by P.programName asc

Since the absence of FK maybe not helping.

Question 8

Not possible. The tables aren't setup with support for foreign keys so that's out. And each program has dozens of id's. For some reason it was setup in a way that gives each version of an application a unique ID. Chrome for example has a few hundred...

Question 9

@Jacobm001 Will not different versions of an application have unique program names anyway ? Also, are R.usageProgramID and R.usageComputerID indexed ?

Question 10

No, the programNames are condensed and are not unique. There is programVersion in the Program table that correlates what version it's at. Yes the two fields are indexed.

Question 11

@Jacobm001 Try the left join. The right join is doing things the other way around.

Question 12

Will do. I'll let you know how it goes.

score 1 · Accepted Answer · 2013-08-14 16:18:30Z

Create foreign keys from RECORDS to PROGRAMS ans LOCATE ( you don't mention if they exist ).
Use LEFT JOIN instead of RIGHT JOIN. After all RECORDS is the "strong" table in this query.
Group by R.usageProgramID instead of by ProgramName.

select divisionName, programName, count(usageID) 
 from records R 
 left join Programs P 
 on P.programID=R.usageProgramID 
 left join locate L 
 on L.computerID=R.usageComputerID 
 where divisionName="umbrella"
 group by R.usageProgramID 
 order by programName asc

Another alternative is to try this:

select
 t.divisionName, P.programName, count(*) as total
from (
 select L.divisionName, R.usageComputerID
 from records R 
 left join locate L 
 on L.computerID=R.usageComputerID 
 where L.divisionName="umbrella"
 ) t 
 left join Programs P 
 on P.programID=t.usageProgramID 
group by
 group by P.programName
 order by P.programName asc

Since the absence of FK maybe not helping.

Not possible. The tables aren't setup with support for foreign keys so that's out. And each program has dozens of id's. For some reason it was setup in a way that gives each version of an application a unique ID. Chrome for example has a few hundred...
@Jacobm001 Will not different versions of an application have unique program names anyway ? Also, are R.usageProgramID and R.usageComputerID indexed ?
No, the programNames are condensed and are not unique. There is programVersion in the Program table that correlates what version it's at. Yes the two fields are indexed.
@Jacobm001 Try the left join. The right join is doing things the other way around.

Stack Exchange Network

Query Optimization for added value

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Query Optimization for added value

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions