4

In my PostgreSQL 12.8 database, I have a relatively simple table the_table with a column value with type varchar:

CREATE TABLE public.the_table (
 id uuid DEFAULT gen_random_uuid() NOT NULL,
 label character varying,
 value character varying,
 created_at timestamp without time zone NOT NULL,
 updated_at timestamp without time zone NOT NULL,
);

I would like to query for all rows with a value that is formatted as an email address. The query looks like this: SELECT * FROM the_table WHERE value ~ '^[a-zA-Z0-9.$%&*+/=?^_{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9]+$'.

As there are a several million rows in that table, I try to speed up this query by adding a matching expression index with CREATE INDEX index_the_table_on_email_values ON the_table ((value ~ '^[a-zA-Z0-9.$%&*+/=?^_{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9]+$'));

Unfortunately, the query planner does not utilize the index and performs a full scan on the table instead, which is very slow.

Can anybody help me fixing the index or tell me what other options I have? I already considered a generated boolean column is_email instead. I could add an index to that generated column and query it directly. But this seems like a weird workaround for the original problem, which should be solvable with a matching index, correct?

asked Sep 11, 2022 at 12:51
3
  • "A lot of rows" is not a useful indication of quantity. And please always disclose your version of Postgres. And the (relevant part of) the table definition as CREATE TABLE script. Commented Sep 11, 2022 at 13:00
  • @ErwinBrandstetter, I updated the details accordingly. Thanks! Commented Sep 11, 2022 at 13:06
  • Works for me. Did you ANALYZE the table after creating the index? Commented Sep 11, 2022 at 15:30

1 Answer 1

6

To be fair, your index on a boolean expression basically works, too.

The point is this: if there is a large percentage of "email" rows, no index is going to help (much) - except for special cases. Postgres will typically chose a faster sequential scan instead. (I suspect that's your case.)
And if there are only few "email" rows, a partial index instead is much more efficient as it excludes most rows to begin with:

CREATE INDEX the_table_email_idx ON the_table ((true))
WHERE value ~ '^[a-zA-Z0-9.$%&*+/=?^_{|}~-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9]+$';

(true) is just an arbitrary constant, since there is no obvious index column. Typically, you have a useful index column on top of the "email filter" that can replace that constant - to make the index even more useful. Related:

Of course, the idea with a generated column is_email isn't that bad either. You would then create a partial index with the condition on that generated column instead. There are pros and cons to this.

answered Sep 11, 2022 at 13:17
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.