This is my current table definition in a Postgres 10.1-1 database:
CREATE TYPE CUSTOMER_TYPE AS ENUM
('enum1', 'enum2', 'enum3', '...', 'enum15'); -- max length of enum names ~15
CREATE TABLE CUSTOMER(
CUSTOMER_ONE TEXT PRIMARY KEY NOT NULL, -- max 35 char String
ATTRIBUTE_ONE TEXT UNIQUE, -- max 35 char String
ATTRIBUTE_TWO TEXT, -- 1-80 char String
PRIVATEKEYTYPE CUSTOMER_TYPE -- see enum
);
It results in about 4.3x more database size compared to the size of the inserted data. (50 MB, 700.000 lines --> database size is 210 MB)
Attribute_One
is computed as hash(Customer_One)
.
Requirements: fast searches (using algorithms) for columns CUSTOMER_ONE
and ATTRIBUTE_ONE
. (That's why I think I need an index.)
Typical search query:
select * from customer
where Customer_One='XXX' OR Attribute_One='XXX';
Each SELECT
can find a maximum of 1 or 0 matching rows in millions of rows.
Is it possible to further decrease the DB size? I have been told to use an expression index but don't fully understand how this works. A short explanation with an example index or other solution would be great
Is the insert speed effected by those indexes? The faster the better. (To be clear: search speed is more important than insert speed.)
1 Answer 1
If hash()
is an IMMUTABLE
function (which should be the case for a function called "hash"!) you can omit storing the functionally dependent attribute_one
in the table altogether and add an expression index to support queries on the expression hash(customer_one)
:
CREATE TABLE customer (
privatekeytype customer_type -- move the enum to 1st pos to save some more
, customer_one text PRIMARY KEY
, attribute_two text
);
Expression index:
CREATE INDEX customer_attribute_one_idx ON customer (hash(customer_one));
This is exactly as big (identical) as the index supporting your original UNIQUE
constraint on the redundant column attribute_one
.
Query:
SELECT *
FROM customer
WHERE 'XXX' IN (customer_one, hash(customer_one));
Testing with EXPLAIN
you'll see index or bitmap index scans like:
-> BitmapOr (cost=5.34..5.34 rows=5 width=0) -> Bitmap Index Scan on customer_pkey (cost=0.00..2.66 rows=1 width=0) Index Cond: ('XXX'::text = customer.customer_one) -> Bitmap Index Scan on customer_attribute_one_idx (cost=0.00..2.68 rows=4 width=0) Index Cond: ('XXX'::text = hash(customer.customer_one))
About the same performance as with the redundant table column or faster since the table is smaller, yet - which helps overall performance in various ways.
Moving the enum
to first position saves a few bytes of alignment padding per row as explained in my previous answer:
Why does the function have to be IMMUTABLE
? See:
Explore related questions
See similar questions with these tags.
Attribute_One is calculated from Customer_One
. How exactly? And what "algorithms" do you use in your "fast searches"? Show a typical example query. Also: your version of Postgres, please.