How to index two tables for JOINed query optimisation

Question 1

I'm using PostgreSQL 9.5 on High Sierra.

Over the two tables:

request_logs - ~ 26K rows
response_logs - ~ 9K rows

I've the following query (with JOIN):

SELECT
 req.uuid,
 res.status,
 req.method,
 req.requesturi,
 req.accessed,
 req.payload reqpayload, 
 res.payload respayload,
 COUNT(*) OVER() AS total_rows
FROM
 request_logs req
 LEFT OUTER JOIN response_logs res ON req.uuid = res.uuid
WHERE
 req.accountid = 2 AND
 req.requesturi not ilike '/v1/sessions%' AND
 req.accessed BETWEEN "2018年01月01日 15:04:05 +0000" and "2019年01月02日 15:04:05+0000" AND
 res.status IS NOT NULL AND
 req.applicationid = 1
ORDER BY
 accessed DESC LIMIT 1000

As I'm trying to optimise the query, I've experimented with different indexes: Here's a list of what I tried:

Configuration 1:
 1. request_log.uuid (pkey, unique)
 2. response_log.uuid (pkey, unique, foreign key)
Response time avg. : 260 ms

Configuration 2:
 1. request_log.uuid (pkey, unique)
 2. request_log.applicationid
 3. response_log.uuid (pkey, unique, foreign key)
Response time avg. : 230 ms

Configuration 3:
 1. request_log.uuid (pkey, unique)
 2. request_log.applicationid
 3. request_log.accessed (timestampz)
 4. response_log.uuid (pkey, unique, foreign key)
Response time avg. : 230 ms

Configuration 4:
 1. request_log.uuid (pkey, unique)
 2. request_log.applicationid
 3. request_log.accessed (timestampz)
 4. request_log.accountid
 5. response_log.uuid (pkey, unique, foreign key)
Response time avg. : 230 ms

Configuration 5:
 1. request_log.uuid (pkey, unique)
 2. request_log.applicationid, request_log.accessed (combined)
 3. response_log.uuid (pkey, unique, foreign key)
Response time avg. : 240 ms

As visible from the result, indexing by applicationid (an int8) did help a little, while indexing by the timestampz accessed didn't help at all. Maybe the bad performance is due to the JOIN? Altogether, it seems quite slow and I try not to think what will happen when these tables contain millions of record (10M+).

What would be a better way to index these tables to make this query run faster?

EDIT:

Here is EXPLAIN ANALYZE for the last configuration:

Limit (cost=3489.80..3490.69 rows=356 width=823) (actual time=241.152..241.345 rows=1000 loops=1)
 -> Sort (cost=3489.80..3490.69 rows=356 width=823) (actual time=241.150..241.288 rows=1000 loops=1)
 Sort Key: req.accessed DESC
 Sort Method: top-N heapsort Memory: 2064kB
 -> WindowAgg (cost=1829.41..3474.71 rows=356 width=823) (actual time=230.040..237.993 rows=3951 loops=1)
 -> Hash Join (cost=1829.41..3470.26 rows=356 width=823) (actual time=8.622..17.974 rows=3951 loops=1)
 Hash Cond: (res.uuid = req.uuid)
 -> Seq Scan on response_logs res (cost=0.00..1604.21 rows=8821 width=758) (actual time=0.006..4.527 rows=9124 loops=1)
 Filter: (status IS NOT NULL)
 -> Hash (cost=1816.39..1816.39 rows=1042 width=102) (actual time=8.243..8.243 rows=4046 loops=1)
 Buckets: 4096 (originally 2048) Batches: 1 (originally 1) Memory Usage: 1122kB
 -> Bitmap Heap Scan on request_logs req (cost=105.85..1816.39 rows=1042 width=102) (actual time=0.581..6.449 rows=4046 loops=1)
Execution time: 242.154 ms

Question 2

Please provide at least the output of EXPLAIN ANALYSE of your statements

Question 3

Thanks. The only thing i can see from the output is, that a Index on response_logs.status may help, if there are many entries with NULL. Also it may be wise to test indexes on all columns used in WHERE. I think i have to ask for \d of both tables or an sqlfiddle.com

Question 4

The LEFT OUTER JOIN tells to add rows with null fields representing res entries that don't match the join clause, but the res.status IS NOT NULL clause tells to remove them. What you probably want is an INNER JOIN instead.

Question 5

The next thing that might go is COUNT(*) OVER() AS total_rows. Do you really need every row to contain the total number of rows? Most interfaces that call a SELECT query know the number of rows from the metadata (i.e. PQntuples() in C).

Question 6

@DanielVérité: count(*) OVER() computes the total count before LIMIT is applied. PQntuples() returns the number of rows actually returned. Only the same while the total count happens to be <= LIMIT. Still, would help performance a lot to get rid of it, especially if your tables grow to 10M+ rows. Maybe it's good enough to use LIMIT 1001, only use the first 1000 rows, check the row count and if it's 1001 you know there are "more than 1000 hits".

Question 7

For starters, get your LEFT JOIN right and try to get rid of total rows (as discussed in comments):

Query

Updated to your comment:

I only want to returns rows from request_logs that have a parallel entry in response_logs with non nil status:

SELECT q.uuid
 , s.status
 , q.method
 , q.requesturi
 , q.accessed
 , q.payload reqpayload
 , s.payload respayload
-- , COUNT(*) OVER() AS total_rows -- see below
FROM request_logs q
JOIN response_logs s USING (uuid) -- not LEFT JOIN
WHERE q.accountid = 2
AND q.requesturi NOT ILIKE '/v1/sessions%'
AND q.accessed BETWEEN '2018-01-01 15:04:05 +0' AND '2019-01-02 15:04:05+0'
AND q.applicationid = 1
AND s.status IS NOT NULL -- see below
ORDER BY q.accessed DESC
LIMIT 1001; -- see below

Your LEFT [OUTER] JOIN burns down to a plain [INNER] JOIN. Related:

Query with LEFT JOIN not returning rows for count of 0

COUNT(*) OVER() AS total_rows is expensive for counts substantially bigger than LIMIT (and you are expecting "10m+ rows"). Maybe it's good enough to use LIMIT 1001, only use the first 1000 rows, check the row count and if it's 1001 then you know there are "more than 1000 matching rows". Related:

Best way to get result count before LIMIT was applied

Indexes

If (like you commented)

the only 'moving parts' in this query are the dates (from and to) that are compared with accessed

... and more than a few rows are excluded with the other predicates on applicationid, requesturi and accountid, then a partial index should help read performance (a lot):

CREATE INDEX ON request_logs (uuid, accessed DESC)
WHERE accountid = 2
AND requesturi NOT ILIKE '/v1/sessions%'
AND applicationid = 1

And if a substantial percentage of rows is eliminated by status IS NOT NULL, also:

CREATE INDEX ON response_logs (uuid)
WHERE status IS NOT NULL;

It may pay to append the columns status and payload as index columns if you can get index-only scans out of it. Some preconditions apply.

Question 8

Actually, I only want to returns rows from request_logs that have a parallel entry in response_logs with non nil status.

Question 9

Regarding the suggestion to LIMIT 1001 - that's a good approach and I use it when the table is very big.

Question 10

@AlexGordon: I adapted to your updates.

Question 11

I did update the indexes as you suggested and it did improve the performance a little bit. However, the main improvement was replacing COUNT(*) OVER() AS total_rows. Because the SELECT part specified a lot of items to be copied (mainly respayload and reqpayload), because of OVER(), the statement executed on the entire table and these columns were copied for the entire table, just to get the count, then discarded and the query executed again. Replacing the COUNT part with a copy of this query and a minimal SELECT part improved the performance by factor of five!

score 3 · Accepted Answer · 2018-07-27 05:58:21Z

For starters, get your LEFT JOIN right and try to get rid of total rows (as discussed in comments):

Query

Updated to your comment:

I only want to returns rows from request_logs that have a parallel entry in response_logs with non nil status:

SELECT q.uuid
 , s.status
 , q.method
 , q.requesturi
 , q.accessed
 , q.payload reqpayload
 , s.payload respayload
-- , COUNT(*) OVER() AS total_rows -- see below
FROM request_logs q
JOIN response_logs s USING (uuid) -- not LEFT JOIN
WHERE q.accountid = 2
AND q.requesturi NOT ILIKE '/v1/sessions%'
AND q.accessed BETWEEN '2018-01-01 15:04:05 +0' AND '2019-01-02 15:04:05+0'
AND q.applicationid = 1
AND s.status IS NOT NULL -- see below
ORDER BY q.accessed DESC
LIMIT 1001; -- see below

Your LEFT [OUTER] JOIN burns down to a plain [INNER] JOIN. Related:

Query with LEFT JOIN not returning rows for count of 0

COUNT(*) OVER() AS total_rows is expensive for counts substantially bigger than LIMIT (and you are expecting "10m+ rows"). Maybe it's good enough to use LIMIT 1001, only use the first 1000 rows, check the row count and if it's 1001 then you know there are "more than 1000 matching rows". Related:

Best way to get result count before LIMIT was applied

Indexes

If (like you commented)

the only 'moving parts' in this query are the dates (from and to) that are compared with accessed

... and more than a few rows are excluded with the other predicates on applicationid, requesturi and accountid, then a partial index should help read performance (a lot):

CREATE INDEX ON request_logs (uuid, accessed DESC)
WHERE accountid = 2
AND requesturi NOT ILIKE '/v1/sessions%'
AND applicationid = 1

And if a substantial percentage of rows is eliminated by status IS NOT NULL, also:

CREATE INDEX ON response_logs (uuid)
WHERE status IS NOT NULL;

It may pay to append the columns status and payload as index columns if you can get index-only scans out of it. Some preconditions apply.

Actually, I only want to returns rows from request_logs that have a parallel entry in response_logs with non nil status.
Regarding the suggestion to LIMIT 1001 - that's a good approach and I use it when the table is very big.
I did update the indexes as you suggested and it did improve the performance a little bit. However, the main improvement was replacing COUNT(*) OVER() AS total_rows. Because the SELECT part specified a lot of items to be copied (mainly respayload and reqpayload), because of OVER(), the statement executed on the entire table and these columns were copied for the entire table, just to get the count, then discarded and the query executed again. Replacing the COUNT part with a copy of this query and a minimal SELECT part improved the performance by factor of five!

Stack Exchange Network

How to index two tables for JOINed query optimisation

1 Answer 1

Query

Indexes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions