1

I'm currently working with a service with a count query running longer than I want. The query doesn't have serious performance issues but I believe there is an improvement.

Let me provide a little bit of context. There is a table with multiple columns, but I'm focused only on a few. Let's say that the table looks like this:

id domain user_id
1 somedomain.com 1
2 subdomain.somedomain.com 1
3 www.somedomain.com 1
4 test.com 1

As you may notice the first 3 rows in the table have something in common - the domains in 2nd and 3rd rows are subdomains on the domain in 1st row (the root domain). The business department wants us to store the root domain and the subdomains. We have a function inside our database that can remove subdomains from stored domains using some inner rules. For simplicity let's say it does the following work:

SELECT 
get_root_domain('subdomain.somedomain.com') as domain_1,
get_root_domain('www.somedomain.com') as domain_2
domain_1 domain_2
somedomain.com somedomain.com

I'm trying to find the total number of unique domains across the table. For example, in the table below we have only 3 unique domains (according to our business needs) - somedomain.com and test.com, and subdomain.somedomain.com.

To be more precisely the root domain (again, according to our needs) in the table is any domain for which the following rule satisfies:

domain = get_root_domain(domain) OR domain = 'www.' || get_root_domain(domain)

The current solution counts the total number using the following query:

WITH roots AS (SELECT DISTINCT get_root_domain(domain) as "root_domain"
 FROM user_domains
 WHERE "user_id" = 1
 AND (domain = get_root_domain(domain) OR domain = 'www.' || get_root_domain(domain))),
 t AS (SELECT first_value("id") OVER w AS "id",
-- some other columns which out of the context
 FROM user_domains ud
 LEFT JOIN roots r
 ON get_root_domain(ud.domain) = r."root_domain"
 WHERE ud."user_ud" = 1
 WINDOW w AS ( PARTITION BY (CASE
 WHEN r."root_domain" IS NULL THEN ud."id"::TEXT
 ELSE format('%s|%s', get_root_domain(ud.domain), ud."status_id") END)
 ORDER BY length(domain) ))
SELECT count(DISTINCT t."id") AS "count"
FROM t

As I said, there are no serious performance issues but the planners say that the left join is the slowest suboperation here. The only reason why I'm looking for another solution is that the query is used to select all fields (I marked them as the comment because they are out of the question context) for another business purpose and the previous team has adopted the query for the counting.

Execution plan: explain.depesz

asked Dec 14, 2023 at 19:23
11
  • You need to post the results of EXPLAIN ANALYZE. It's more convenient to use this site: explain.depesz.com Commented Dec 14, 2023 at 21:57
  • I updated the post and added the link. Thanks for the highlighting! Commented Dec 14, 2023 at 23:09
  • I think the problem is simply that the function get_root_domain() is slow. Can you time it by doing EXPLAIN ANALYZE select get_root_domain(domain) FROM user_domains WHERE (set a condition where the domain is not null nor empty and will make the function actually work) LIMIT 100 -- since the table scan will be very fast most of the time should be in the function. If it doesn't take long enough for a proper measurement, increase the LIMIT. Commented Dec 15, 2023 at 0:09
  • Your plan in internally inconsistent, and also doesn't match the query you show. Commented Dec 15, 2023 at 0:52
  • 1
    Same result though, I misplaced the comma in both queries (lol)! 69µs, most of your query time is in that function. It takes 4µs with a regexp. You can cache the result of this function in a generated column: postgresql.org/docs/current/ddl-generated-columns.html (and then you can index it if that helps some other queries) Commented Dec 15, 2023 at 1:37

1 Answer 1

1

From our discussion in the comments, it turns out most of the time is spent in the get_root_domain() function.

Currently you have 3-4 calls per row, 2 in roots CTE and 1-2 in the query depending what the CASE does.

  • Quick band-aid:

If it's a simple string function that will always return the same value for the same parameter, after checking the code to make sure, you can mark it as IMMUTABLE, either with CREATE OR REPLACE FUNCTION or ALTER FUNCTION. So when it is used twice with the same parameter as in your query, postgres will only call it once.

However, it will still be called in the CTE and in the query, so this will reduce the number of calls from 3-4 to 2.

  • Quick band-aid v2:

Since you're going to execute it once per row anyway, cache it in a CTE:

WITH ur AS (SELECT *, get_root_domain(domain) as "root_domain"
 FROM user_domains
 WHERE "user_id" = 1),

...then the rest of the query uses that CTE instead of table user_domains, with the pre-filled root_domain column. This will reduce the number of calls to 1.

  • Check if you can optimize the function

  • Put it in a generated column

    ALTER TABLE ... ADD root_domain TEXT GENERATED ALWAYS AS (get_root_domain(domain)) STORED

This will lock the table during update, but once it's done it's done. You can then index the generated column if needed.

answered Dec 15, 2023 at 11:24
2
  • The function is declared as immutable. It's a good idea to look into this and thanks again. I was hoping that there is a pattern for queries like that when you need to exclude duplicates by a criteria and then merge it with unique values. Commented Dec 15, 2023 at 12:24
  • 1
    Not really a pattern except "if it's expensive, do it as little as possible", of which all of the points in my answer are variants lol Commented Dec 15, 2023 at 13:51

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.