I'm using PostgreSQL 9.5 on High Sierra.
Over the two tables:
request_logs - ~ 26K rows response_logs - ~ 9K rows
I've the following query (with JOIN):
SELECT req.uuid, res.status, req.method, req.requesturi, req.accessed, req.payload reqpayload, res.payload respayload, COUNT(*) OVER() AS total_rows FROM request_logs req LEFT OUTER JOIN response_logs res ON req.uuid = res.uuid WHERE req.accountid = 2 AND req.requesturi not ilike '/v1/sessions%' AND req.accessed BETWEEN "2018年01月01日 15:04:05 +0000" and "2019年01月02日 15:04:05+0000" AND res.status IS NOT NULL AND req.applicationid = 1 ORDER BY accessed DESC LIMIT 1000
As I'm trying to optimise the query, I've experimented with different indexes: Here's a list of what I tried:
Configuration 1: 1. request_log.uuid (pkey, unique) 2. response_log.uuid (pkey, unique, foreign key) Response time avg. : 260 ms
Configuration 2: 1. request_log.uuid (pkey, unique) 2. request_log.applicationid 3. response_log.uuid (pkey, unique, foreign key) Response time avg. : 230 ms
Configuration 3: 1. request_log.uuid (pkey, unique) 2. request_log.applicationid 3. request_log.accessed (timestampz) 4. response_log.uuid (pkey, unique, foreign key) Response time avg. : 230 ms
Configuration 4: 1. request_log.uuid (pkey, unique) 2. request_log.applicationid 3. request_log.accessed (timestampz) 4. request_log.accountid 5. response_log.uuid (pkey, unique, foreign key) Response time avg. : 230 ms
Configuration 5: 1. request_log.uuid (pkey, unique) 2. request_log.applicationid, request_log.accessed (combined) 3. response_log.uuid (pkey, unique, foreign key) Response time avg. : 240 ms
As visible from the result, indexing by applicationid
(an int8
) did help a little, while indexing by the timestampz
accessed
didn't help at all.
Maybe the bad performance is due to the JOIN?
Altogether, it seems quite slow and I try not to think what will happen when these tables contain millions of record (10M+).
What would be a better way to index these tables to make this query run faster?
EDIT:
Here is EXPLAIN ANALYZE
for the last configuration:
Limit (cost=3489.80..3490.69 rows=356 width=823) (actual time=241.152..241.345 rows=1000 loops=1) -> Sort (cost=3489.80..3490.69 rows=356 width=823) (actual time=241.150..241.288 rows=1000 loops=1) Sort Key: req.accessed DESC Sort Method: top-N heapsort Memory: 2064kB -> WindowAgg (cost=1829.41..3474.71 rows=356 width=823) (actual time=230.040..237.993 rows=3951 loops=1) -> Hash Join (cost=1829.41..3470.26 rows=356 width=823) (actual time=8.622..17.974 rows=3951 loops=1) Hash Cond: (res.uuid = req.uuid) -> Seq Scan on response_logs res (cost=0.00..1604.21 rows=8821 width=758) (actual time=0.006..4.527 rows=9124 loops=1) Filter: (status IS NOT NULL) -> Hash (cost=1816.39..1816.39 rows=1042 width=102) (actual time=8.243..8.243 rows=4046 loops=1) Buckets: 4096 (originally 2048) Batches: 1 (originally 1) Memory Usage: 1122kB -> Bitmap Heap Scan on request_logs req (cost=105.85..1816.39 rows=1042 width=102) (actual time=0.581..6.449 rows=4046 loops=1) Execution time: 242.154 ms
1 Answer 1
For starters, get your LEFT JOIN
right and try to get rid of total rows (as discussed in comments):
Query
Updated to your comment:
I only want to returns rows from
request_logs
that have a parallel entry inresponse_logs
with non nil status:
SELECT q.uuid
, s.status
, q.method
, q.requesturi
, q.accessed
, q.payload reqpayload
, s.payload respayload
-- , COUNT(*) OVER() AS total_rows -- see below
FROM request_logs q
JOIN response_logs s USING (uuid) -- not LEFT JOIN
WHERE q.accountid = 2
AND q.requesturi NOT ILIKE '/v1/sessions%'
AND q.accessed BETWEEN '2018-01-01 15:04:05 +0' AND '2019-01-02 15:04:05+0'
AND q.applicationid = 1
AND s.status IS NOT NULL -- see below
ORDER BY q.accessed DESC
LIMIT 1001; -- see below
Your LEFT [OUTER] JOIN
burns down to a plain [INNER] JOIN
. Related:
COUNT(*) OVER() AS total_rows
is expensive for counts substantially bigger than LIMIT
(and you are expecting "10m+ rows"). Maybe it's good enough to use LIMIT 1001
, only use the first 1000 rows, check the row count and if it's 1001 then you know there are "more than 1000 matching rows". Related:
Indexes
If (like you commented)
the only 'moving parts' in this query are the dates (from and to) that are compared with accessed
... and more than a few rows are excluded with the other predicates
on applicationid
, requesturi
and accountid
, then a partial index should help read performance (a lot):
CREATE INDEX ON request_logs (uuid, accessed DESC)
WHERE accountid = 2
AND requesturi NOT ILIKE '/v1/sessions%'
AND applicationid = 1
And if a substantial percentage of rows is eliminated by status IS NOT NULL
, also:
CREATE INDEX ON response_logs (uuid)
WHERE status IS NOT NULL;
It may pay to append the columns status
and payload
as index columns if you can get index-only scans out of it. Some preconditions apply.
-
Actually, I only want to returns rows from
request_logs
that have a parallel entry inresponse_logs
with non nilstatus
.Alechko– Alechko2018年07月27日 08:27:07 +00:00Commented Jul 27, 2018 at 8:27 -
Regarding the suggestion to
LIMIT 1001
- that's a good approach and I use it when the table is very big.Alechko– Alechko2018年07月27日 08:29:10 +00:00Commented Jul 27, 2018 at 8:29 -
1@AlexGordon: I adapted to your updates.Erwin Brandstetter– Erwin Brandstetter2018年07月27日 13:55:56 +00:00Commented Jul 27, 2018 at 13:55
-
1I did update the indexes as you suggested and it did improve the performance a little bit. However, the main improvement was replacing
COUNT(*) OVER() AS total_rows
. Because the SELECT part specified a lot of items to be copied (mainlyrespayload
andreqpayload
), because ofOVER()
, the statement executed on the entire table and these columns were copied for the entire table, just to get the count, then discarded and the query executed again. Replacing theCOUNT
part with a copy of this query and a minimalSELECT
part improved the performance by factor of five!Alechko– Alechko2018年07月27日 14:04:04 +00:00Commented Jul 27, 2018 at 14:04
EXPLAIN ANALYSE
of your statementsresponse_logs.status
may help, if there are many entries withNULL
. Also it may be wise to test indexes on all columns used inWHERE
. I think i have to ask for\d
of both tables or an sqlfiddle.comres
entries that don't match the join clause, but theres.status IS NOT NULL
clause tells to remove them. What you probably want is an INNER JOIN instead.COUNT(*) OVER() AS total_rows
. Do you really need every row to contain the total number of rows? Most interfaces that call a SELECT query know the number of rows from the metadata (i.e.PQntuples()
in C).count(*) OVER()
computes the total count beforeLIMIT
is applied.PQntuples()
returns the number of rows actually returned. Only the same while the total count happens to be <=LIMIT
. Still, would help performance a lot to get rid of it, especially if your tables grow to 10M+ rows. Maybe it's good enough to useLIMIT 1001
, only use the first 1000 rows, check the row count and if it's 1001 you know there are "more than 1000 hits".