How can I improve this nested postgresql query?

Question 1

I have a mildly complex query that is having rather poor performance:

UPDATE
 web_pages
SET
 state = 'fetching'
WHERE
 web_pages.id = (
 SELECT
 web_pages.id
 FROM
 web_pages
 WHERE
 web_pages.state = 'new'
 AND
 normal_fetch_mode = true
 AND
 web_pages.priority = (
 SELECT
 min(priority)
 FROM
 web_pages
 WHERE
 state = 'new'::dlstate_enum
 AND
 distance < 1000000
 AND
 normal_fetch_mode = true
 AND
 web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval
 )
 AND
 web_pages.distance < 1000000
 AND
 web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval
 LIMIT 1
 )
AND
 web_pages.state = 'new'
RETURNING
 web_pages.id;

EXPLAIN ANALYZE:

 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Update on web_pages (cost=2.12..10.14 rows=1 width=798) (actual time=2312.127..2312.127 rows=0 loops=1)
 InitPlan 3 (returns 2ドル)
 -> Limit (cost=1.21..1.56 rows=1 width=4) (actual time=2312.118..2312.118 rows=0 loops=1)
 InitPlan 2 (returns 1ドル)
 -> Result (cost=0.77..0.78 rows=1 width=0) (actual time=2312.109..2312.110 rows=1 loops=1)
 InitPlan 1 (returns 0ドル)
 -> Limit (cost=0.43..0.77 rows=1 width=4) (actual time=2312.106..2312.106 rows=0 loops=1)
 -> Index Scan using ix_web_pages_distance_filtered on web_pages web_pages_1 (cost=0.43..176587.44 rows=509043 width=4) (actual time=2312.103..2312.103 rows=0 loops=1)
 Index Cond: (priority IS NOT NULL)
 Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
 -> Index Scan using ix_web_pages_distance_filtered on web_pages web_pages_2 (cost=0.43..35375.47 rows=101809 width=4) (actual time=2312.116..2312.116 rows=0 loops=1)
 Index Cond: (priority = 1ドル)
 Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
 -> Index Scan using ix_web_pages_id on web_pages (cost=0.56..8.58 rows=1 width=798) (actual time=2312.124..2312.124 rows=0 loops=1)
 Index Cond: (id = 2ドル)
 Filter: (state = 'new'::dlstate_enum)
 Planning time: 1.712 ms
 Execution time: 2313.699 ms
(18 rows)

Table Schema:

 Table "public.web_pages"
 Column | Type | Modifiers
-------------------+-----------------------------+---------------------------------------------------------------------
 id | integer | not null default nextval('web_pages_id_seq'::regclass)
 state | dlstate_enum | not null
 errno | integer |
 url | text | not null
 starturl | text | not null
 netloc | text | not null
 file | integer |
 priority | integer | not null
 distance | integer | not null
 is_text | boolean |
 limit_netloc | boolean |
 title | citext |
 mimetype | text |
 type | itemtype_enum |
 content | text |
 fetchtime | timestamp without time zone |
 addtime | timestamp without time zone |
 tsv_content | tsvector |
 normal_fetch_mode | boolean | default true
 ignoreuntiltime | timestamp without time zone | not null default '1970-01-01 00:00:00'::timestamp without time zone
Indexes:
 "web_pages_pkey" PRIMARY KEY, btree (id)
 "ix_web_pages_url" UNIQUE, btree (url)
 "idx_web_pages_title" gin (to_tsvector('english'::regconfig, title::text))
 "ix_web_pages_distance" btree (distance)
 "ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true
 "ix_web_pages_id" btree (id)
 "ix_web_pages_netloc" btree (netloc)
 "ix_web_pages_priority" btree (priority)
 "ix_web_pages_state" btree (state)
 "ix_web_pages_url_ops" btree (url text_pattern_ops)
 "web_pages_state_netloc_idx" btree (state, netloc)
Foreign-key constraints:
 "web_pages_file_fkey" FOREIGN KEY (file) REFERENCES web_files(id)
Triggers:
 update_row_count_trigger BEFORE INSERT OR UPDATE ON web_pages FOR EACH ROW EXECUTE PROCEDURE web_pages_content_update_func()

I've experimented with creating compound indexes on multiple columns to try to improve the query performance, without much luck. I have VACUUM ANALYZEd for the above EXPLAIN.

The cardinality of the priority column is quite low, it has about 5 distinct values, and the size of the overall table is fairly large (55,659,673 rows).

Query execution time is rather variable, generally 2 seconds worst-case, 600 milliseconds best case, when the entire index is cached in ram (when the DB isn't under other loads).

It seems that the major load is the min(priority) subselect, but I haven't had much luck with creating indices that improve it's performance, though that may entirely be operator error:

EXPLAIN ANALYZE 
SELECT
 min(priority)
FROM
 web_pages
WHERE
 state = 'new'::dlstate_enum
AND
 distance < 1000000
AND
 normal_fetch_mode = true
AND
 web_pages.ignoreuntiltime < current_timestamp + '5 minutes'::interval;
 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result (cost=0.77..0.78 rows=1 width=0) (actual time=625.380..625.381 rows=1 loops=1)
 InitPlan 1 (returns 0ドル)
 -> Limit (cost=0.43..0.77 rows=1 width=4) (actual time=625.375..625.375 rows=0 loops=1)
 -> Index Scan using ix_web_pages_distance_filtered on web_pages (cost=0.43..176587.44 rows=509043 width=4) (actual time=625.373..625.373 rows=0 loops=1)
 Index Cond: (priority IS NOT NULL)
 Filter: (ignoreuntiltime < (now() + '00:05:00'::interval))
 Planning time: 0.475 ms
 Execution time: 625.408 ms
(8 rows)

Are there any easy ways to improve the performance of this query? I've thought about maintaining a running count of each sub-value in the column with a append-only count table that's updated with triggers, but that's complex and a fair bit of effort, and I want to be sure there isn't a simpler approach before implementing all that.

Question 2

I'm not very sure of this, but maybe you can remove the min subquery by adding order by priority right before the limit in the the first subquery. I would hope that post free can use the index to just read the min priority record and stop after 1. I think this is right because your query seems to assume a unique priority per webpage and both subqueries have the same filters.

Question 3

As i see from the query you simply updating one page with minimum priority and some other conditions. I suggest to build a partial B-tree index on priority column i.e.:

CREATE INDEX some_idx ON web_pages (priority)
WHERE state = 'new'::dlstate_enum
 AND
 distance < 1000000
 AND
 normal_fetch_mode = true;

Question 4

I literally have that exact index on the table already. Did you not look at the schema at all? `

ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true

Question 5

ok, sorry. i really didin't see you whole post precisely. If indices won't help and cardinality of priority is low, if i were you,i'd split this table in 5 partitions (each for it's priority level). Technically, i will nest base table and then add constraint by priority on each, so the planner can choose the right one. And next you can do partial index on remaining conditions, i.e.:WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true and then do B-tree index on ignoreuntiltime...

ArtemP ArtemP 691 gold badge1 silver badge4 bronze badges · Answer 1 · 2016-08-13 22:07:36Z

-1

As i see from the query you simply updating one page with minimum priority and some other conditions. I suggest to build a partial B-tree index on priority column i.e.:

CREATE INDEX some_idx ON web_pages (priority)
WHERE state = 'new'::dlstate_enum
 AND
 distance < 1000000
 AND
 normal_fetch_mode = true;

Share

Improve this answer

answered Aug 13, 2016 at 22:07

ArtemP's user avatar

ArtemP ArtemP

691 gold badge1 silver badge4 bronze badges

2

I literally have that exact index on the table already. Did you not look at the schema at all? ` ix_web_pages_distance_filtered" btree (priority) WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true

Fake Name
– Fake Name

2016年08月13日 23:51:24 +00:00
Commented Aug 13, 2016 at 23:51
ok, sorry. i really didin't see you whole post precisely. If indices won't help and cardinality of priority is low, if i were you,i'd split this table in 5 partitions (each for it's priority level). Technically, i will nest base table and then add constraint by priority on each, so the planner can choose the right one. And next you can do partial index on remaining conditions, i.e.:WHERE state = 'new'::dlstate_enum AND distance < 1000000 AND normal_fetch_mode = true and then do B-tree index on ignoreuntiltime...

ArtemP
– ArtemP

2016年08月14日 00:12:40 +00:00
Commented Aug 14, 2016 at 0:12

Add a comment |

Stack Exchange Network

How can I improve this nested postgresql query?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How can I improve this nested postgresql query?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions