Optimize Postgres query for finding string collisions

Question 1

For a system we're building, we store discount codes as strings in a Postgres table. We have a system where we support multiple workspaces that share the same database, and have a special value ('*') that is used as a wildcard.

For discount codes, we store the following information:

CREATE TABLE discount_codes
 id uuid PRIMARY KEY
, workspace_id character varying
, code character varying
, case_sensitive bool
);

Sometimes we have to generate thousands of codes to be exported to external systems, where they can be sent out in e-mails and such. When generating these codes, we need to check if none of the codes overlap.

Currently, when inserting a code, we query existing codes like this:

The new discount code is case sensitive:

SELECT
 COUNT(*)
FROM
 discount_codes
WHERE
 (
 workspace_id = '*' OR
 workspace_id = :workspaceId
 ) AND
 (
 (
 LOWER(code) = LOWER(:code) AND
 case_sensitive = false
 ) OR (
 code = :code AND
 case_sensitive = true
 )
 )

The new discount code is not case sensitive:

SELECT
 COUNT(*)
FROM
 discount_codes
WHERE
 (workspace_id = '*' OR workspace_id = :workspaceId) AND
 LOWER(code) = LOWER(:code)

If this query returns a count of more than 0, we know that there is a collision.

I would like to know if it would be useful to create an index on the length of the code, so that we can filter out all rows where code has a different length. We're talking about hundred thousands of codes being present in the database. If it would help, how would I create an index like this?

During bulk generation, I would like to generate 1000 codes at a time, and query the database to see if there is overlap with any of these codes. Would it be better to do this per 100 codes, or per 10000 codes?

Question 2

Use explain(analyze, verbose, buffers, settings) to get the query plan and see where the time is spent. And you can create an index on LOWER(code): CREATE INDEX idx_ discount_codes_lower_code ON discount_codes(LOWER(code));

Question 3

By the way, why don't you use a UUID? That is unique

Question 4

Why do you think you need to optimise anything? How is your current process unsatisfactory? "We're talking about hundred thousands of codes" -- this isn't really all that many.

Question 5

@FrankHeikens because these are codes that need to stick to a pattern requested by our partners. Sometimes it contains prefixes, suffixes, and we don't want them to be unnecessarily long.

Question 6

@mustaccio I might be optimizing for a case that isn't a problem right now, but I want to make sure bulk inserting runs as fast as possible, so that my endpoints don't take seconds to complete. If I can filter out a lot of codes by just removing all codes of a different length, that seemed like a proper optimization to me, and that's what I'd like advice on.

Question 7

This set of indices and queries should give you the best overall performance:

Queries

New discount code is case sensitive:

SELECT EXISTS (
 SELECT FROM discount_codes
 WHERE case_sensitive
 AND code = :code
 AND workspace_id IN ('*', :workspaceId)
 )
 OR EXISTS (
 SELECT FROM discount_codes
 WHERE NOT case_sensitive
 AND lower(code) = lower(:code)
 AND workspace_id IN ('*', :workspaceId)
 );

Counting is generally more expensive than EXISTS.
And both subqueries would always be executed to get a count. You only need to know if there is any conflict at all. This query will not even execute the second subquery if the first one returns true.

New discount code is not case sensitive:

SELECT EXISTS (
 SELECT FROM discount_codes
 WHERE lower(code) = lower(:code)
 AND workspace_id IN ('*', :workspaceId)
 );

Indices

Note that the UNIQUE aspect in below indices enforces your requirements only in parts and is hence optional. I would still throw it in as very cheap second layer of defense.

CREATE INDEX discount_codes_idx1 ON discount_codes (lower(code), workspace_id); -- can't be unique
CREATE UNIQUE INDEX discount_codes_idx2 ON discount_codes (code, workspace_id)
WHERE case_sensitive;

Theoretically, you might add another one:

CREATE UNIQUE INDEX discount_codes_idx3 ON discount_codes (lower(code), workspace_id)
WHERE NOT case_sensitive;

But, assuming the combination (lower(code), workspace_id) is already hugely selective,discount_codes_idx1 should cover the job of discount_codes_idx3 pretty well, and you don't have to maintain another index in your write-heavy table.

fiddle

Question 8

Hi, the only issue I see with this is that workspace_id = '*', code = 'CHRISTMAS10' and workspace_id = 'acme', code = 'CHRISTMAS10' can't co-exist, so that's why that unique index might not be right. I might be optimizing for a case I shouldn't, but if I know that all my codes generated in a batch are of length x, I can already filter out all codes that are not length x, and that's the index I'd be looking for.

Question 9

@Ruben: ('*', 'CHRISTMAS10') and ('acme', 'CHRISTMAS10') can co-exist with either of my multicolumn indices, which only enforce your requirements in part. (That's why we still need the sophisticated queries.) UNIQUE is really optional. But I would keep it as second layer of defense. I updated to clarify.

Question 10

If you want good performance for your first query, rewrite it to avoid OR:

SELECT (SELECT count(*)
 FROM discount_codes
 WHERE workspace_id = '*'
 AND LOWER(code) = lower(:code)
 AND case_sensitive)
 + (SELECT count(*)
 FROM discount_codes
 WHERE workspace_id = '*'
 AND code = :code
 AND NOT case_sensitive)
 + (SELECT count(*)
 FROM discount_codes
 WHERE workspace_id = :workspaceId
 AND LOWER(code) = lower(:code)
 AND case_sensitive)
 + (SELECT count(*)
 FROM discount_codes
 WHERE workspace_id = :workspaceId
 AND code = :code
 AND NOT case_sensitive);

For the best performance, use two partial indexes:

CREATE INDEX ON discount_codes (workspace_id, lower(code))
 WHERE case_sensitive;
CREATE INDEX ON discount_codes (workspace_id, code)
 WHERE NOT case_sensitive;

score 1 · Answer 1 · 2024-10-24 07:53:49Z

This set of indices and queries should give you the best overall performance:

Queries

New discount code is case sensitive:

SELECT EXISTS (
 SELECT FROM discount_codes
 WHERE case_sensitive
 AND code = :code
 AND workspace_id IN ('*', :workspaceId)
 )
 OR EXISTS (
 SELECT FROM discount_codes
 WHERE NOT case_sensitive
 AND lower(code) = lower(:code)
 AND workspace_id IN ('*', :workspaceId)
 );

Counting is generally more expensive than EXISTS.
And both subqueries would always be executed to get a count. You only need to know if there is any conflict at all. This query will not even execute the second subquery if the first one returns true.

New discount code is not case sensitive:

SELECT EXISTS (
 SELECT FROM discount_codes
 WHERE lower(code) = lower(:code)
 AND workspace_id IN ('*', :workspaceId)
 );

Indices

Note that the UNIQUE aspect in below indices enforces your requirements only in parts and is hence optional. I would still throw it in as very cheap second layer of defense.

CREATE INDEX discount_codes_idx1 ON discount_codes (lower(code), workspace_id); -- can't be unique
CREATE UNIQUE INDEX discount_codes_idx2 ON discount_codes (code, workspace_id)
WHERE case_sensitive;

Theoretically, you might add another one:

CREATE UNIQUE INDEX discount_codes_idx3 ON discount_codes (lower(code), workspace_id)
WHERE NOT case_sensitive;

But, assuming the combination (lower(code), workspace_id) is already hugely selective,discount_codes_idx1 should cover the job of discount_codes_idx3 pretty well, and you don't have to maintain another index in your write-heavy table.

fiddle

Hi, the only issue I see with this is that workspace_id = '*', code = 'CHRISTMAS10' and workspace_id = 'acme', code = 'CHRISTMAS10' can't co-exist, so that's why that unique index might not be right. I might be optimizing for a case I shouldn't, but if I know that all my codes generated in a batch are of length x, I can already filter out all codes that are not length x, and that's the index I'd be looking for.
@Ruben: ('*', 'CHRISTMAS10') and ('acme', 'CHRISTMAS10') can co-exist with either of my multicolumn indices, which only enforce your requirements in part. (That's why we still need the sophisticated queries.) UNIQUE is really optional. But I would keep it as second layer of defense. I updated to clarify.

Laurenz Albe Laurenz Albe 62k4 gold badges57 silver badges93 bronze badges · Answer 2 · 2024-10-23 04:31:46Z

If you want good performance for your first query, rewrite it to avoid OR:

SELECT (SELECT count(*)
 FROM discount_codes
 WHERE workspace_id = '*'
 AND LOWER(code) = lower(:code)
 AND case_sensitive)
 + (SELECT count(*)
 FROM discount_codes
 WHERE workspace_id = '*'
 AND code = :code
 AND NOT case_sensitive)
 + (SELECT count(*)
 FROM discount_codes
 WHERE workspace_id = :workspaceId
 AND LOWER(code) = lower(:code)
 AND case_sensitive)
 + (SELECT count(*)
 FROM discount_codes
 WHERE workspace_id = :workspaceId
 AND code = :code
 AND NOT case_sensitive);

For the best performance, use two partial indexes:

CREATE INDEX ON discount_codes (workspace_id, lower(code))
 WHERE case_sensitive;
CREATE INDEX ON discount_codes (workspace_id, code)
 WHERE NOT case_sensitive;

Stack Exchange Network

Optimize Postgres query for finding string collisions

2 Answers 2

Queries

Indices

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimize Postgres query for finding string collisions

2 Answers 2

Queries

Indices

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions