@nhgrif and I were curious about a couple different ways to aggregate a number of records based on a duplicate/common datum. Two different ideas will be shown. Any advice on all code but especially making it perform better is appreciated.
Note: @nhgrif did not write most of this, so if it's wrong it's my fault...
Make a bunch of data
Code:
set search_path = test;
-- create tables to seed a large number of records
drop table if exists contract_number_seed;
create table contract_number_seed(
contract_number int
);
insert into contract_number_seed
(contract_number)
select * from generate_series(1, 100000)
;
drop table if exists serial_number_seed;
create table serial_number_seed(
serial_number int
);
insert into serial_number_seed
(serial_number)
select * from generate_series(1, 10)
;
drop table if exists status_seed;
create table status_seed(
status text
);
insert into status_seed (status)
values (NULL), ('one'), ('two'), ('three');
-- generate a whole bunch of records from seed tables
drop table if exists seeded;
select
contract_number_seed.contract_number,
serial_number_seed.serial_number,
status_seed.status
into seeded
from contract_number_seed
cross join serial_number_seed
cross join status_seed;
-- Query returned successfully: 4000000 rows affected, 11368 ms execution time.
Just for the record:
set search_path = test;
select * from seeded;
-- Total query runtime: 8056 ms.
-- 4000000 rows retrieved.
Using LEFT OUTER JOIN
PS: Capitalization is different in this; @nhgrif wrote it and I was lazy.
PPS: Table aliases are not that great but I'm just benchmarking, please forgive...
SET SEARCH_PATH = test;
SELECT
seeded.Serial_Number,
COUNT(one.Contract_Number) as One,
COUNT(two.Contract_Number) as Two,
COUNT(three.Contract_Number) as Three
FROM seeded
LEFT JOIN seeded as one
ON one.Contract_Number = seeded.Contract_Number
AND seeded.Status = 'one'
LEFT JOIN seeded as two
ON two.Contract_Number = seeded.Contract_Number
AND seeded.Status = 'two'
LEFT JOIN seeded as three
ON three.Contract_Number = seeded.Contract_Number
AND seeded.Status = 'three'
GROUP BY seeded.Serial_Number, seeded.Status;
I found the performance of this to be abysmal:
Total query runtime: 3096218 ms. -- ~30 minutes
10 rows retrieved.
A faster way... Using case when ... then
set search_path = test;
select
seeded.Serial_Number,
case
when seeded.Status = 'one' then 1
else 0
end,
case
when seeded.Status = 'two' then 2
else 0
end,
case
when seeded.Status = 'three' then 3
else 0
end
from seeded
group by seeded.Serial_number, seeded.Status;
Performance:
Total query runtime: 5609 ms.
40 rows retrieved.
Are there better ways to do this? I found the first way to work OK on small queries but really lack on big ones. I found the 2nd one to be fast, however any new column in seeded.Serial_Number
requires a refactoring of the queries...
2 Answers 2
There is a bug in your second version. Here is one row from the result:
1;0;2;0
That is, Serial_Number
1 has zero ones, two twos, and zero threes.
But let's do a quick sanity check:
select count(*) from seeded
where serial_number = 1 and status = 'one';
This query returns 100000.
There are also rows with the same Serial_Number
, which I think is not what you want.
I think you want a query like this:
select
seeded.Serial_Number,
sum(case when seeded.Status = 'one' then 1 else 0 end) as ones,
sum(case when seeded.Status = 'two' then 1 else 0 end) as twos,
sum(case when seeded.Status = 'three' then 1 else 0 end) as threes
from seeded
group by seeded.Serial_Number;
This takes ~2.1s on my machine.
Just for fun, let's look at what pgAdmin's EXPLAIN visualisation looks like for this query
enter image description here
versus the 30m query
enter image description here
-
\$\begingroup\$ Wouldn't
SUM( Status = 'one' )
etc. work? \$\endgroup\$hjpotter92– hjpotter922014年08月19日 04:23:13 +00:00Commented Aug 19, 2014 at 4:23 -
\$\begingroup\$ I needed your sanity check. Did not realize the code was broken/not yielding expected results, to begin with. Thank you & accepted. \$\endgroup\$Phrancis– Phrancis2014年08月19日 04:26:16 +00:00Commented Aug 19, 2014 at 4:26
-
\$\begingroup\$ @hjpotter92 no, but
sum(cast(seeded.Status = 'one' as integer))
would (and appears to be slightly faster!). \$\endgroup\$mjolka– mjolka2014年08月19日 04:27:49 +00:00Commented Aug 19, 2014 at 4:27
One thing that's absolutely certain to speed up both queries is for the Status
field to be an integer (or perhaps even a smallint or tinyint depending on what type of DB you're using and what's available).
If you feel it necessary to include a plain-English title for the status itself, include a StatusCodes
table with two columns, StatusID
, and StatusCode
, and the table we're referencing in the query from the question uses the StatusID
in its Status
column.
Depending on your database settings, text will be UTF-8, UTF-16, or UTF-32 most likely. The number part of these refers to the minimum bits per character. That means, best case scenario, your status of 'one'
take 3 bytes.
Meanwhile, a smallint
takes just 2 bytes.
At this level of size, memory really isn't probably that much of an issue. Even with a million rows, you only save a million bytes having small ints versus having either 'one'
or 'two'
in the column.
But there are only so many statuses you can represent with just those 3 bytes, and we're using plain-English statuses so they're more readable.
The bigger concern though is comparison time. It will generally take the query longer to compare varchar fields than integer fields because the varchar fields will generally have more bytes on average per row to compare.
Meanwhile, with a 1 byte tinyint, you can describe 256 statuses. If that's somehow not enough, a 2 byte smallint will describe 32768 statuses if you only want to use the non-negatives (because negatives seem slightly weird in this case), but if you truly needed it, you could describe 65,536 statuses with the 2 byte small int.
Technically, you could make exactly the same number of unique varchars that are just 2 bytes, but many of the characters are whitespace characters, and several others would just serve as funny looking characters that wouldn't be very indistinguishable to most of us.
Basically, there's not a particularly good reason for the Status
field to be anything but an integer type.