Need help with count query

Question 1

I'm currently working with a service with a count query running longer than I want. The query doesn't have serious performance issues but I believe there is an improvement.

Let me provide a little bit of context. There is a table with multiple columns, but I'm focused only on a few. Let's say that the table looks like this:

id	domain	user_id
1	somedomain.com	1
2	subdomain.somedomain.com	1
3	www.somedomain.com	1
4	test.com	1

As you may notice the first 3 rows in the table have something in common - the domains in 2nd and 3rd rows are subdomains on the domain in 1st row (the root domain). The business department wants us to store the root domain and the subdomains. We have a function inside our database that can remove subdomains from stored domains using some inner rules. For simplicity let's say it does the following work:

SELECT 
get_root_domain('subdomain.somedomain.com') as domain_1,
get_root_domain('www.somedomain.com') as domain_2

domain_1	domain_2
somedomain.com	somedomain.com

I'm trying to find the total number of unique domains across the table. For example, in the table below we have only 3 unique domains (according to our business needs) - somedomain.com and test.com, and subdomain.somedomain.com.

To be more precisely the root domain (again, according to our needs) in the table is any domain for which the following rule satisfies:

domain = get_root_domain(domain) OR domain = 'www.' || get_root_domain(domain)

The current solution counts the total number using the following query:

WITH roots AS (SELECT DISTINCT get_root_domain(domain) as "root_domain"
 FROM user_domains
 WHERE "user_id" = 1
 AND (domain = get_root_domain(domain) OR domain = 'www.' || get_root_domain(domain))),
 t AS (SELECT first_value("id") OVER w AS "id",
-- some other columns which out of the context
 FROM user_domains ud
 LEFT JOIN roots r
 ON get_root_domain(ud.domain) = r."root_domain"
 WHERE ud."user_ud" = 1
 WINDOW w AS ( PARTITION BY (CASE
 WHEN r."root_domain" IS NULL THEN ud."id"::TEXT
 ELSE format('%s|%s', get_root_domain(ud.domain), ud."status_id") END)
 ORDER BY length(domain) ))
SELECT count(DISTINCT t."id") AS "count"
FROM t

As I said, there are no serious performance issues but the planners say that the left join is the slowest suboperation here. The only reason why I'm looking for another solution is that the query is used to select all fields (I marked them as the comment because they are out of the question context) for another business purpose and the previous team has adopted the query for the counting.

Execution plan: explain.depesz

Question 2

You need to post the results of EXPLAIN ANALYZE. It's more convenient to use this site: explain.depesz.com

Question 3

I updated the post and added the link. Thanks for the highlighting!

Question 4

I think the problem is simply that the function get_root_domain() is slow. Can you time it by doing EXPLAIN ANALYZE select get_root_domain(domain) FROM user_domains WHERE (set a condition where the domain is not null nor empty and will make the function actually work) LIMIT 100 -- since the table scan will be very fast most of the time should be in the function. If it doesn't take long enough for a proper measurement, increase the LIMIT.

Question 5

Your plan in internally inconsistent, and also doesn't match the query you show.

Question 6

Same result though, I misplaced the comma in both queries (lol)! 69µs, most of your query time is in that function. It takes 4µs with a regexp. You can cache the result of this function in a generated column: postgresql.org/docs/current/ddl-generated-columns.html (and then you can index it if that helps some other queries)

Question 7

From our discussion in the comments, it turns out most of the time is spent in the get_root_domain() function.

Currently you have 3-4 calls per row, 2 in roots CTE and 1-2 in the query depending what the CASE does.

Quick band-aid:

If it's a simple string function that will always return the same value for the same parameter, after checking the code to make sure, you can mark it as IMMUTABLE, either with CREATE OR REPLACE FUNCTION or ALTER FUNCTION. So when it is used twice with the same parameter as in your query, postgres will only call it once.

However, it will still be called in the CTE and in the query, so this will reduce the number of calls from 3-4 to 2.

Quick band-aid v2:

Since you're going to execute it once per row anyway, cache it in a CTE:

WITH ur AS (SELECT *, get_root_domain(domain) as "root_domain"
 FROM user_domains
 WHERE "user_id" = 1),

...then the rest of the query uses that CTE instead of table user_domains, with the pre-filled root_domain column. This will reduce the number of calls to 1.

Check if you can optimize the function
Put it in a generated column

ALTER TABLE ... ADD root_domain TEXT GENERATED ALWAYS AS (get_root_domain(domain)) STORED

This will lock the table during update, but once it's done it's done. You can then index the generated column if needed.

Question 8

The function is declared as immutable. It's a good idea to look into this and thanks again. I was hoping that there is a pattern for queries like that when you need to exclude duplicates by a criteria and then merge it with unique values.

Question 9

Not really a pattern except "if it's expensive, do it as little as possible", of which all of the points in my answer are variants lol

bobflux bobflux 1,7761 gold badge9 silver badges7 bronze badges · Answer 1 · 2023-12-15 11:24:43Z

From our discussion in the comments, it turns out most of the time is spent in the get_root_domain() function.

Currently you have 3-4 calls per row, 2 in roots CTE and 1-2 in the query depending what the CASE does.

Quick band-aid:

If it's a simple string function that will always return the same value for the same parameter, after checking the code to make sure, you can mark it as IMMUTABLE, either with CREATE OR REPLACE FUNCTION or ALTER FUNCTION. So when it is used twice with the same parameter as in your query, postgres will only call it once.

However, it will still be called in the CTE and in the query, so this will reduce the number of calls from 3-4 to 2.

Quick band-aid v2:

Since you're going to execute it once per row anyway, cache it in a CTE:

WITH ur AS (SELECT *, get_root_domain(domain) as "root_domain"
 FROM user_domains
 WHERE "user_id" = 1),

...then the rest of the query uses that CTE instead of table user_domains, with the pre-filled root_domain column. This will reduce the number of calls to 1.

Check if you can optimize the function
Put it in a generated column

ALTER TABLE ... ADD root_domain TEXT GENERATED ALWAYS AS (get_root_domain(domain)) STORED

This will lock the table during update, but once it's done it's done. You can then index the generated column if needed.

The function is declared as immutable. It's a good idea to look into this and thanks again. I was hoping that there is a pattern for queries like that when you need to exclude duplicates by a criteria and then merge it with unique values.
Not really a pattern except "if it's expensive, do it as little as possible", of which all of the points in my answer are variants lol

Stack Exchange Network

Need help with count query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Need help with count query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions