I have a very large table (35GB) that is unique over a combination of four of its columns.
The table isn't very wide and the four columns it is unique over are the columns that are larger (in bytes). The end result is that the index to keep the table unique is 21GB. This isn't a result of the index bloating in size over time but is the size of the the index immediately after it is created.
I don't need to optimize for insert speed at all, as inserts will only happen in batches once per month. There won't be any updates to any rows once they are inserted.
I'm running PostgreSQL 9.5.0.
Is there a way to not duplicate such a large portion of the database just to enforce a unique constraint? Possibly using something like a clustered index?
Full table description:
CREATE TABLE medi_cal_base_eligibility (
client_index_number text NOT NULL,
medi_cal_date date NOT NULL,
eligibility_date date NOT NULL,
aidcode text,
responsible_county text,
status text,
cardinal smallint NOT NULL,
id SERIAL PRIMARY KEY
);
Indexes:
"medi_cal_base_eligibility_pkey" PRIMARY KEY, btree
(id)
"medi_cal_base_eligibility_uq_dates_cin_cardinal" UNIQUE CONSTRAINT, btree
(eligibility_date, client_index_number, medi_cal_date, cardinal)
1 Answer 1
With PostgreSQL 9.5 you can enjoy a BRIN index (which will make the index very small, yet functional), and handle the exclusions via a trigger, like this:
CREATE INDEX ON medi_cal_base_eligibility USING BRIN (client_index_number);
CREATE OR REPLACE FUNCTION tf_medi_cal_base_eligibility_insert() RETURNS trigger AS
$BODY$
BEGIN
IF (TG_OP = 'INSERT' OR (NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal) IS DISTINCT FROM (OLD.client_index_number, OLD.eligibility_date, OLD.medi_cal_date, OLD.cardinal)) AND
EXISTS (SELECT 1 FROM medi_cal_base_eligibility
WHERE (client_index_number, eligibility_date, medi_cal_date, cardinal) = (NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal)) THEN
RAISE 'Duplicate key: %, %, %, %', NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal;
RETURN NULL;
END IF;
RETURN NEW;
END;
$BODY$ LANGUAGE plpgsql VOLATILE SECURITY DEFINER;
CREATE TRIGGER t_medi_cal_base_eligibility_insert BEFORE INSERT OR UPDATE ON medi_cal_base_eligibility
FOR EACH ROW EXECUTE PROCEDURE tf_medi_cal_base_eligibility_insert();
As stated by dezso, a BRIN index would only be useful in a situation in which client_index_number has a correlation to the position of the record in the file.
If the solution above using BRIN won't cut the mustard, a good choice is using a hash of the data to search for. The size of the hash will determine how many records it will have to scan to look for uniqueness; also, the bigger the hash, the bigger the index. A hash of 32 bits would most likely render unique results (or at most a handful) and would be just as large as your primary key. In the following example I'm getting the 32 bits hash by using the last 8 hexadecimal digits of the md5 function using your four unique fields concatenated together.
CREATE OR REPLACE FUNCTION f_medi_cal_base_eligibility_to_int (p_client_index_number text, p_medi_cal_date date, p_eligibility_date date, p_cardinal smallint) RETURNS int AS $BODY$
SELECT ('x'||right(md5(1ドル || to_char(2,ドル 'YYYYMMDD') || to_char(3,ドル 'YYYYMMDD') || 4ドル::text), 8))::bit(32)::int
$BODY$ LANGUAGE SQL IMMUTABLE SECURITY DEFINER;
CREATE INDEX ON medi_cal_base_eligibility (f_medi_cal_base_eligibility_to_int(client_index_number, medi_cal_date, eligibility_date, cardinal));
CREATE OR REPLACE FUNCTION tf_medi_cal_base_eligibility_insert() RETURNS trigger AS
$BODY$
BEGIN
IF (TG_OP = 'INSERT' OR (NEW.client_index_number, NEW.eligibility_date, NEW.medi_cal_date, NEW.cardinal) IS DISTINCT FROM (OLD.client_index_number, OLD.eligibility_date, OLD.medi_cal_date, OLD.cardinal)) AND
EXISTS (SELECT 1 FROM medi_cal_base_eligibility
WHERE (f_medi_cal_base_eligibility_to_int(client_index_number, medi_cal_date, eligibility_date, cardinal), client_index_number, medi_cal_date, eligibility_date, cardinal) =
(f_medi_cal_base_eligibility_to_int(NEW.client_index_number, NEW.medi_cal_date, NEW.eligibility_date, NEW.cardinal), NEW.client_index_number, NEW.medi_cal_date, NEW.eligibility_date, NEW.cardinal)) THEN
RAISE 'Duplicate key: %, %, %, %', NEW.client_index_number, NEW.medi_cal_date, NEW.eligibility_date, NEW.cardinal;
RETURN NULL;
END IF;
RETURN NEW;
END;
$BODY$ LANGUAGE plpgsql VOLATILE SECURITY DEFINER;
CREATE TRIGGER t_medi_cal_base_eligibility_insert BEFORE INSERT OR UPDATE ON medi_cal_base_eligibility
FOR EACH ROW EXECUTE PROCEDURE tf_medi_cal_base_eligibility_insert();
-
1Please note the following (from the documentation page I've added to your post): 'BRIN is designed for handling very large tables in which certain columns have some natural correlation with their physical location within the table.'András Váczi– András Váczi2016年03月30日 09:17:10 +00:00Commented Mar 30, 2016 at 9:17
-
I've added an alternative to the solution relying on the
BRIN
indexEzequiel Tolnay– Ezequiel Tolnay2016年03月31日 00:32:21 +00:00Commented Mar 31, 2016 at 0:32 -
1@ZiggyCrueltyfreeZeitgeister Thank you for a very involved answer. I'm not going to have a chance to really look into using it until next week, but I will let you know how it goes.Gregory Arenius– Gregory Arenius2016年04月01日 19:59:22 +00:00Commented Apr 1, 2016 at 19:59
Explore related questions
See similar questions with these tags.
select count(*) as count_rows, count(distinct client_index_number text) as count_distinct_text, avg(char_length(medi_cal_base_eligibility)) as avg_length from medi_cal_base_eligibility;
?