I have a mildly complex query that is having rather poor performance:
UPDATE
web_pages
SET
state = 'fetching'
WHERE
web_pages.id = (
SELECT
web_pages.id
FROM
web_pages
WHERE
web_pages.state = 'new'
AND
normal_fetch_mode = true
AND
web_pages.priority = (
SELECT
min(priority)
FROM
web_pages
WHERE
state = 'new'::dlstate_enum
AND
distance < 1000000
AND
normal_fetch_mode = true
AND
web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval
)
AND
web_pages.distance < 1000000
AND
web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval
LIMIT 1
)
AND
web_pages.state = 'new'
RETURNING
web_pages.id;
EXPLAIN ANALYZE
:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Update on web_pages (cost=2.12..10.14 rows=1 width=798) (actual time=2312.127..2312.127 rows=0 loops=1)
InitPlan 3 (returns 2ドル)
-> Limit (cost=1.21..1.56 rows=1 width=4) (actual time=2312.118..2312.118 rows=0 loops=1)
InitPlan 2 (returns 1ドル)
-> Result (cost=0.77..0.78 rows=1 width=0) (actual time=2312.109..2312.110 rows=1 loops=1)
InitPlan 1 (returns 0ドル)
-> Limit (cost=0.43..0.77 rows=1 width=4) (actual time=2312.106..2312.106 rows=0 loops=1)
-> Index Scan using ix_web_pages_distance_filtered on web_pages web_pages_1 (cost=0.43..176587.44 rows=509043 width=4) (actual time=2312.103..2312.103 rows=0 loops=1)
Index Cond: (priority IS NOT NULL)
Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
-> Index Scan using ix_web_pages_distance_filtered on web_pages web_pages_2 (cost=0.43..35375.47 rows=101809 width=4) (actual time=2312.116..2312.116 rows=0 loops=1)
Index Cond: (priority = 1ドル)
Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
-> Index Scan using ix_web_pages_id on web_pages (cost=0.56..8.58 rows=1 width=798) (actual time=2312.124..2312.124 rows=0 loops=1)
Index Cond: (id = 2ドル)
Filter: (state = 'new'::dlstate_enum)
Planning time: 1.712 ms
Execution time: 2313.699 ms
(18 rows)
Table Schema:
Table "public.web_pages"
Column | Type | Modifiers
-------------------+-----------------------------+---------------------------------------------------------------------
id | integer | not null default nextval('web_pages_id_seq'::regclass)
state | dlstate_enum | not null
errno | integer |
url | text | not null
starturl | text | not null
netloc | text | not null
file | integer |
priority | integer | not null
distance | integer | not null
is_text | boolean |
limit_netloc | boolean |
title | citext |
mimetype | text |
type | itemtype_enum |
content | text |
fetchtime | timestamp without time zone |
addtime | timestamp without time zone |
tsv_content | tsvector |
normal_fetch_mode | boolean | default true
ignoreuntiltime | timestamp without time zone | not null default '1970-01-01 00:00:00'::timestamp without time zone
Indexes:
"web_pages_pkey" PRIMARY KEY, btree (id)
"ix_web_pages_url" UNIQUE, btree (url)
"idx_web_pages_title" gin (to_tsvector('english'::regconfig, title::text))
"ix_web_pages_distance" btree (distance)
"ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true
"ix_web_pages_id" btree (id)
"ix_web_pages_netloc" btree (netloc)
"ix_web_pages_priority" btree (priority)
"ix_web_pages_state" btree (state)
"ix_web_pages_url_ops" btree (url text_pattern_ops)
"web_pages_state_netloc_idx" btree (state, netloc)
Foreign-key constraints:
"web_pages_file_fkey" FOREIGN KEY (file) REFERENCES web_files(id)
Triggers:
update_row_count_trigger BEFORE INSERT OR UPDATE ON web_pages FOR EACH ROW EXECUTE PROCEDURE web_pages_content_update_func()
I've experimented with creating compound indexes on multiple columns to try to improve the query performance, without much luck. I have VACUUM ANALYZE
d for the above EXPLAIN
.
The cardinality of the priority
column is quite low, it has about 5 distinct values, and the size of the overall table is fairly large (55,659,673 rows).
Query execution time is rather variable, generally 2 seconds worst-case, 600 milliseconds best case, when the entire index is cached in ram (when the DB isn't under other loads).
It seems that the major load is the min(priority)
subselect, but I haven't had much luck with creating indices that improve it's performance, though that may entirely be operator error:
EXPLAIN ANALYZE
SELECT
min(priority)
FROM
web_pages
WHERE
state = 'new'::dlstate_enum
AND
distance < 1000000
AND
normal_fetch_mode = true
AND
web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.77..0.78 rows=1 width=0) (actual time=625.380..625.381 rows=1 loops=1)
InitPlan 1 (returns 0ドル)
-> Limit (cost=0.43..0.77 rows=1 width=4) (actual time=625.375..625.375 rows=0 loops=1)
-> Index Scan using ix_web_pages_distance_filtered on web_pages (cost=0.43..176587.44 rows=509043 width=4) (actual time=625.373..625.373 rows=0 loops=1)
Index Cond: (priority IS NOT NULL)
Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
Planning time: 0.475 ms
Execution time: 625.408 ms
(8 rows)
Are there any easy ways to improve the performance of this query? I've thought about maintaining a running count of each sub-value in the column with a append-only count table that's updated with triggers, but that's complex and a fair bit of effort, and I want to be sure there isn't a simpler approach before implementing all that.
1 Answer 1
As i see from the query you simply updating one page with minimum priority and some other conditions. I suggest to build a partial B-tree index on priority column i.e.:
CREATE INDEX some_idx ON web_pages (priority)
WHERE state = 'new'::dlstate_enum
AND
distance < 1000000
AND
normal_fetch_mode = true;
-
I literally have that exact index on the table already. Did you not look at the schema at all? `
ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true
Fake Name– Fake Name2016年08月13日 23:51:24 +00:00Commented Aug 13, 2016 at 23:51 -
ok, sorry. i really didin't see you whole post precisely. If indices won't help and cardinality of priority is low, if i were you,i'd split this table in 5 partitions (each for it's priority level). Technically, i will nest base table and then add constraint by priority on each, so the planner can choose the right one. And next you can do partial index on remaining conditions, i.e.:
WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true
and then do B-tree index on ignoreuntiltime...ArtemP– ArtemP2016年08月14日 00:12:40 +00:00Commented Aug 14, 2016 at 0:12
Explore related questions
See similar questions with these tags.
min
subquery by addingorder by priority
right before the limit in the the first subquery. I would hope that post free can use the index to just read the min priority record and stop after 1. I think this is right because your query seems to assume a unique priority per webpage and both subqueries have the same filters.