Benchmarking of SQL data crunching using a common datum

Question 1

@nhgrif and I were curious about a couple different ways to aggregate a number of records based on a duplicate/common datum. Two different ideas will be shown. Any advice on all code but especially making it perform better is appreciated.

Note: @nhgrif did not write most of this, so if it's wrong it's my fault...

Make a bunch of data

Code:

set search_path = test;
-- create tables to seed a large number of records
drop table if exists contract_number_seed;
create table contract_number_seed(
 contract_number int
 );
insert into contract_number_seed 
 (contract_number)
 select * from generate_series(1, 100000)
 ;
drop table if exists serial_number_seed;
create table serial_number_seed(
 serial_number int
 );
insert into serial_number_seed
 (serial_number)
 select * from generate_series(1, 10)
 ;
drop table if exists status_seed;
create table status_seed(
 status text
 );
insert into status_seed (status)
values (NULL), ('one'), ('two'), ('three');
-- generate a whole bunch of records from seed tables
drop table if exists seeded;
select 
 contract_number_seed.contract_number,
 serial_number_seed.serial_number,
 status_seed.status
into seeded
from contract_number_seed
cross join serial_number_seed
cross join status_seed;
-- Query returned successfully: 4000000 rows affected, 11368 ms execution time.

Just for the record:

set search_path = test;
select * from seeded;
-- Total query runtime: 8056 ms.
-- 4000000 rows retrieved.

Using LEFT OUTER JOIN

PS: Capitalization is different in this; @nhgrif wrote it and I was lazy.
PPS: Table aliases are not that great but I'm just benchmarking, please forgive...

SET SEARCH_PATH = test;
SELECT 
 seeded.Serial_Number, 
 COUNT(one.Contract_Number) as One, 
 COUNT(two.Contract_Number) as Two, 
 COUNT(three.Contract_Number) as Three
FROM seeded
 LEFT JOIN seeded as one 
 ON one.Contract_Number = seeded.Contract_Number
 AND seeded.Status = 'one'
 LEFT JOIN seeded as two 
 ON two.Contract_Number = seeded.Contract_Number
 AND seeded.Status = 'two'
 LEFT JOIN seeded as three 
 ON three.Contract_Number = seeded.Contract_Number
 AND seeded.Status = 'three'
GROUP BY seeded.Serial_Number, seeded.Status;

I found the performance of this to be abysmal:

Total query runtime: 3096218 ms. -- ~30 minutes
10 rows retrieved.

A faster way... Using case when ... then

set search_path = test;
select 
 seeded.Serial_Number, 
 case
 when seeded.Status = 'one' then 1
 else 0
 end,
 case
 when seeded.Status = 'two' then 2
 else 0
 end,
 case
 when seeded.Status = 'three' then 3
 else 0
 end
from seeded
group by seeded.Serial_number, seeded.Status;

Performance:

Total query runtime: 5609 ms.
40 rows retrieved.

Are there better ways to do this? I found the first way to work OK on small queries but really lack on big ones. I found the 2nd one to be fast, however any new column in seeded.Serial_Number requires a refactoring of the queries...

Question 2

There is a bug in your second version. Here is one row from the result:

1;0;2;0

That is, Serial_Number 1 has zero ones, two twos, and zero threes.

But let's do a quick sanity check:

select count(*) from seeded
where serial_number = 1 and status = 'one';

This query returns 100000.

There are also rows with the same Serial_Number, which I think is not what you want.

I think you want a query like this:

select 
 seeded.Serial_Number, 
 sum(case when seeded.Status = 'one' then 1 else 0 end) as ones,
 sum(case when seeded.Status = 'two' then 1 else 0 end) as twos,
 sum(case when seeded.Status = 'three' then 1 else 0 end) as threes
from seeded
group by seeded.Serial_Number;

This takes ~2.1s on my machine.

Just for fun, let's look at what pgAdmin's EXPLAIN visualisation looks like for this query

enter image description here

versus the 30m query

enter image description here

Question 3

Wouldn't SUM( Status = 'one' ) etc. work?

Question 4

I needed your sanity check. Did not realize the code was broken/not yielding expected results, to begin with. Thank you & accepted.

Question 5

@hjpotter92 no, but sum(cast(seeded.Status = 'one' as integer)) would (and appears to be slightly faster!).

Question 6

One thing that's absolutely certain to speed up both queries is for the Status field to be an integer (or perhaps even a smallint or tinyint depending on what type of DB you're using and what's available).

If you feel it necessary to include a plain-English title for the status itself, include a StatusCodes table with two columns, StatusID, and StatusCode, and the table we're referencing in the query from the question uses the StatusID in its Status column.

Depending on your database settings, text will be UTF-8, UTF-16, or UTF-32 most likely. The number part of these refers to the minimum bits per character. That means, best case scenario, your status of 'one' take 3 bytes.

Meanwhile, a smallint takes just 2 bytes.

At this level of size, memory really isn't probably that much of an issue. Even with a million rows, you only save a million bytes having small ints versus having either 'one' or 'two' in the column.

But there are only so many statuses you can represent with just those 3 bytes, and we're using plain-English statuses so they're more readable.

The bigger concern though is comparison time. It will generally take the query longer to compare varchar fields than integer fields because the varchar fields will generally have more bytes on average per row to compare.

Meanwhile, with a 1 byte tinyint, you can describe 256 statuses. If that's somehow not enough, a 2 byte smallint will describe 32768 statuses if you only want to use the non-negatives (because negatives seem slightly weird in this case), but if you truly needed it, you could describe 65,536 statuses with the 2 byte small int.

Technically, you could make exactly the same number of unique varchars that are just 2 bytes, but many of the characters are whitespace characters, and several others would just serve as funny looking characters that wouldn't be very indistinguishable to most of us.

Basically, there's not a particularly good reason for the Status field to be anything but an integer type.

mjolka mjolka 16.3k2 gold badges30 silver badges73 bronze badges · Accepted Answer · 2014-08-19 04:13:51Z

There is a bug in your second version. Here is one row from the result:

1;0;2;0

That is, Serial_Number 1 has zero ones, two twos, and zero threes.

But let's do a quick sanity check:

select count(*) from seeded
where serial_number = 1 and status = 'one';

This query returns 100000.

There are also rows with the same Serial_Number, which I think is not what you want.

I think you want a query like this:

select 
 seeded.Serial_Number, 
 sum(case when seeded.Status = 'one' then 1 else 0 end) as ones,
 sum(case when seeded.Status = 'two' then 1 else 0 end) as twos,
 sum(case when seeded.Status = 'three' then 1 else 0 end) as threes
from seeded
group by seeded.Serial_Number;

This takes ~2.1s on my machine.

Just for fun, let's look at what pgAdmin's EXPLAIN visualisation looks like for this query

enter image description here

versus the 30m query

enter image description here

I needed your sanity check. Did not realize the code was broken/not yielding expected results, to begin with. Thank you & accepted.
@hjpotter92 no, but sum(cast(seeded.Status = 'one' as integer)) would (and appears to be slightly faster!).

Stack Exchange Network

Benchmarking of SQL data crunching using a common datum

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Benchmarking of SQL data crunching using a common datum

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions