How to not "duplicate" a table if it has a unique constraint over many columns? [duplicate]

Question 1

I have a very large table (35GB) that is unique over a combination of four of its columns.

The table isn't very wide and the four columns it is unique over are the columns that are larger (in bytes). The end result is that the index to keep the table unique is 21GB. This isn't a result of the index bloating in size over time but is the size of the the index immediately after it is created.

I don't need to optimize for insert speed at all, as inserts will only happen in batches once per month. There won't be any updates to any rows once they are inserted.

I'm running PostgreSQL 9.5.0.

Is there a way to not duplicate such a large portion of the database just to enforce a unique constraint? Possibly using something like a clustered index?

Full table description:

CREATE TABLE medi_cal_base_eligibility (
 client_index_number text NOT NULL,
 medi_cal_date date NOT NULL,
 eligibility_date date NOT NULL,
 aidcode text,
 responsible_county text,
 status text,
 cardinal smallint NOT NULL,
 id SERIAL PRIMARY KEY
);

Indexes:

"medi_cal_base_eligibility_pkey" PRIMARY KEY, btree 
 (id)
"medi_cal_base_eligibility_uq_dates_cin_cardinal" UNIQUE CONSTRAINT, btree 
 (eligibility_date, client_index_number, medi_cal_date, cardinal)

Question 2

Can you tell us the output of

select count(*) as count_rows, count(distinct client_index_number text) as count_distinct_text, avg(char_length(medi_cal_base_eligibility)) as avg_length from medi_cal_base_eligibility;

?

Question 3

Why do you want to avoid that 'duplication', in the first place? Also, are there queries that might benefit form this unique index?

Question 4

@dezso The index is there because of the unique constraint. The unique constraint is there to help ensure we don't end up with duplicate data. I do think some of the queries run against this table benefit from the index. However, as the index is composed of many of the columns that make up the table its 2/3s as large as the table itself. I would like to avoid that if possible, mainly so that a larger portion of the data can remain in memory at any given time.

Question 5

Please check the answer on a similar question: dba.stackexchange.com/a/119402/6219 - especially what Craig writes about concurrency.

Question 6

Is there a performance problem that you are hoping to resolve by trying to make sure that "a larger portion of the data can remain in memory"? If so, may be you can address that in a different way than fighting unique indexes?

Question 7

With PostgreSQL 9.5 you can enjoy a BRIN index (which will make the index very small, yet functional), and handle the exclusions via a trigger, like this:

CREATE INDEX ON medi_cal_base_eligibility USING BRIN (client_index_number);
CREATE OR REPLACE FUNCTION tf_medi_cal_base_eligibility_insert() RETURNS trigger AS
$BODY$
BEGIN
 IF (TG_OP = 'INSERT' OR (NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal) IS DISTINCT FROM (OLD.client_index_number, OLD.eligibility_date, OLD.medi_cal_date, OLD.cardinal)) AND
 EXISTS (SELECT 1 FROM medi_cal_base_eligibility
 WHERE (client_index_number, eligibility_date, medi_cal_date, cardinal) = (NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal)) THEN
 RAISE 'Duplicate key: %, %, %, %', NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal;
 RETURN NULL;
 END IF;
 RETURN NEW;
END;
$BODY$ LANGUAGE plpgsql VOLATILE SECURITY DEFINER;
CREATE TRIGGER t_medi_cal_base_eligibility_insert BEFORE INSERT OR UPDATE ON medi_cal_base_eligibility
 FOR EACH ROW EXECUTE PROCEDURE tf_medi_cal_base_eligibility_insert();

As stated by dezso, a BRIN index would only be useful in a situation in which client_index_number has a correlation to the position of the record in the file.

If the solution above using BRIN won't cut the mustard, a good choice is using a hash of the data to search for. The size of the hash will determine how many records it will have to scan to look for uniqueness; also, the bigger the hash, the bigger the index. A hash of 32 bits would most likely render unique results (or at most a handful) and would be just as large as your primary key. In the following example I'm getting the 32 bits hash by using the last 8 hexadecimal digits of the md5 function using your four unique fields concatenated together.

CREATE OR REPLACE FUNCTION f_medi_cal_base_eligibility_to_int (p_client_index_number text, p_medi_cal_date date, p_eligibility_date date, p_cardinal smallint) RETURNS int AS $BODY$
 SELECT ('x'||right(md5(1ドル || to_char(2,ドル 'YYYYMMDD') || to_char(3,ドル 'YYYYMMDD') || 4ドル::text), 8))::bit(32)::int
$BODY$ LANGUAGE SQL IMMUTABLE SECURITY DEFINER;
CREATE INDEX ON medi_cal_base_eligibility (f_medi_cal_base_eligibility_to_int(client_index_number, medi_cal_date, eligibility_date, cardinal));
CREATE OR REPLACE FUNCTION tf_medi_cal_base_eligibility_insert() RETURNS trigger AS
$BODY$
BEGIN
 IF (TG_OP = 'INSERT' OR (NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal) IS DISTINCT FROM (OLD.client_index_number, OLD.eligibility_date, OLD.medi_cal_date, OLD.cardinal)) AND
 EXISTS (SELECT 1 FROM medi_cal_base_eligibility
 WHERE (f_medi_cal_base_eligibility_to_int(client_index_number, medi_cal_date, eligibility_date, cardinal), client_index_number, medi_cal_date, eligibility_date, cardinal) =
 (f_medi_cal_base_eligibility_to_int(NEW.client_index_number, NEW.medi_cal_date, NEW.eligibility_date, NEW.cardinal), NEW.client_index_number, NEW.medi_cal_date, NEW.eligibility_date, NEW.cardinal)) THEN
 RAISE 'Duplicate key: %, %, %, %', NEW.client_index_number, NEW.medi_cal_date, NEW.eligibility_date, NEW.cardinal;
 RETURN NULL;
 END IF;
 RETURN NEW;
END;
$BODY$ LANGUAGE plpgsql VOLATILE SECURITY DEFINER;
CREATE TRIGGER t_medi_cal_base_eligibility_insert BEFORE INSERT OR UPDATE ON medi_cal_base_eligibility
 FOR EACH ROW EXECUTE PROCEDURE tf_medi_cal_base_eligibility_insert();

Question 8

Please note the following (from the documentation page I've added to your post): 'BRIN is designed for handling very large tables in which certain columns have some natural correlation with their physical location within the table.'

Question 9

I've added an alternative to the solution relying on the BRIN index

Question 10

@ZiggyCrueltyfreeZeitgeister Thank you for a very involved answer. I'm not going to have a chance to really look into using it until next week, but I will let you know how it goes.

score 0 · Answer 1 · 2016-03-30 08:54:44Z

With PostgreSQL 9.5 you can enjoy a BRIN index (which will make the index very small, yet functional), and handle the exclusions via a trigger, like this:

CREATE INDEX ON medi_cal_base_eligibility USING BRIN (client_index_number);
CREATE OR REPLACE FUNCTION tf_medi_cal_base_eligibility_insert() RETURNS trigger AS
$BODY$
BEGIN
 IF (TG_OP = 'INSERT' OR (NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal) IS DISTINCT FROM (OLD.client_index_number, OLD.eligibility_date, OLD.medi_cal_date, OLD.cardinal)) AND
 EXISTS (SELECT 1 FROM medi_cal_base_eligibility
 WHERE (client_index_number, eligibility_date, medi_cal_date, cardinal) = (NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal)) THEN
 RAISE 'Duplicate key: %, %, %, %', NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal;
 RETURN NULL;
 END IF;
 RETURN NEW;
END;
$BODY$ LANGUAGE plpgsql VOLATILE SECURITY DEFINER;
CREATE TRIGGER t_medi_cal_base_eligibility_insert BEFORE INSERT OR UPDATE ON medi_cal_base_eligibility
 FOR EACH ROW EXECUTE PROCEDURE tf_medi_cal_base_eligibility_insert();

As stated by dezso, a BRIN index would only be useful in a situation in which client_index_number has a correlation to the position of the record in the file.

If the solution above using BRIN won't cut the mustard, a good choice is using a hash of the data to search for. The size of the hash will determine how many records it will have to scan to look for uniqueness; also, the bigger the hash, the bigger the index. A hash of 32 bits would most likely render unique results (or at most a handful) and would be just as large as your primary key. In the following example I'm getting the 32 bits hash by using the last 8 hexadecimal digits of the md5 function using your four unique fields concatenated together.

CREATE OR REPLACE FUNCTION f_medi_cal_base_eligibility_to_int (p_client_index_number text, p_medi_cal_date date, p_eligibility_date date, p_cardinal smallint) RETURNS int AS $BODY$
 SELECT ('x'||right(md5(1ドル || to_char(2,ドル 'YYYYMMDD') || to_char(3,ドル 'YYYYMMDD') || 4ドル::text), 8))::bit(32)::int
$BODY$ LANGUAGE SQL IMMUTABLE SECURITY DEFINER;
CREATE INDEX ON medi_cal_base_eligibility (f_medi_cal_base_eligibility_to_int(client_index_number, medi_cal_date, eligibility_date, cardinal));
CREATE OR REPLACE FUNCTION tf_medi_cal_base_eligibility_insert() RETURNS trigger AS
$BODY$
BEGIN
 IF (TG_OP = 'INSERT' OR (NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal) IS DISTINCT FROM (OLD.client_index_number, OLD.eligibility_date, OLD.medi_cal_date, OLD.cardinal)) AND
 EXISTS (SELECT 1 FROM medi_cal_base_eligibility
 WHERE (f_medi_cal_base_eligibility_to_int(client_index_number, medi_cal_date, eligibility_date, cardinal), client_index_number, medi_cal_date, eligibility_date, cardinal) =
 (f_medi_cal_base_eligibility_to_int(NEW.client_index_number, NEW.medi_cal_date, NEW.eligibility_date, NEW.cardinal), NEW.client_index_number, NEW.medi_cal_date, NEW.eligibility_date, NEW.cardinal)) THEN
 RAISE 'Duplicate key: %, %, %, %', NEW.client_index_number, NEW.medi_cal_date, NEW.eligibility_date, NEW.cardinal;
 RETURN NULL;
 END IF;
 RETURN NEW;
END;
$BODY$ LANGUAGE plpgsql VOLATILE SECURITY DEFINER;
CREATE TRIGGER t_medi_cal_base_eligibility_insert BEFORE INSERT OR UPDATE ON medi_cal_base_eligibility
 FOR EACH ROW EXECUTE PROCEDURE tf_medi_cal_base_eligibility_insert();

Please note the following (from the documentation page I've added to your post): 'BRIN is designed for handling very large tables in which certain columns have some natural correlation with their physical location within the table.'
I've added an alternative to the solution relying on the BRIN index
@ZiggyCrueltyfreeZeitgeister Thank you for a very involved answer. I'm not going to have a chance to really look into using it until next week, but I will let you know how it goes.

Stack Exchange Network

How to not "duplicate" a table if it has a unique constraint over many columns? [duplicate]

1 Answer 1

Linked

Hot Network Questions

How to not "duplicate" a table if it has a unique constraint over many columns? [duplicate]

1 Answer 1

Linked

Related

Hot Network Questions