0

I have table with one billion rows and more than 50 columns. I need to reduce size and speed up queries, backup, exports, etc. Some columns contain f.e. only hundreds of distinct values which are long URLs (text data type), used application names and similar duplicate information.

Is there some PG tool, script for PostgreSQL 9.3+ which can easily for selected columns create dictionaries of distinct values to other tables and after that update original values with SmallInt identificator from that dictionary? Do I have to write SQL for that manually?

TableOriginal
1;VeryLongURLText
2;VeryLongURLText
3;LoooongURLText
4;LoooongURLText
5;LoooongURLText
TableDictionary
1;VeryLongURLText
2;LoooongURLText
TableUpdated
1;1
2;1
3;2
3;2
3;2

Thank you.

asked May 19, 2016 at 7:04
3
  • postgresql.org/docs/9.5/static/sql-createdomain.html is Create domain the function which should be used? Commented May 19, 2016 at 7:30
  • 1
    A domain is a kind of data type. It's a short hand for commonly used column types where you can e.g. enforce check constraints that should be applied to all columns that store the same type of information. Commented May 19, 2016 at 8:11
  • For a billion rows you will better create all of your dictionaries first, and then make a copy of your original table using all of the dictionaries. Also, consider using smallint (i.e. int2) rather than int4 for dictionaries with few expected values. Commented May 19, 2016 at 8:55

1 Answer 1

1

Do I have to write SQL for that manually?

Yes, but it's not that hard:

create table original (id integer, url text);
insert into original 
values
(1,'VeryLongURLText'),
(2,'VeryLongURLText'),
(3,'LoooongURLText'),
(4,'LoooongURLText'),
(5,'LoooongURLText');

create the dictionary

create table dictionary (id serial, url text);
insert into dictionary (url)
select distinct url
from original;

This creates the table with the following content:

id | data 
---+----------------
 1 | LoooongURLText 
 2 | VeryLongURLText

Now create a new table based on the dictionary:

create table compressed 
as
select o.id, o.some_column, o.other_column, d.id as dictionary_id
from original o
 join dictionary d on o.url = d.url;

As your goal is to reduce the space overhead it's better to create new table with the dictionary id rather then altering the existing one. This will also be a lot faster then updating all rows from the existing table (with a billion rows this will however still take some time)

answered May 19, 2016 at 7:43
2
  • In your compressed table you still have the URL. You should not use o.* but name all fields except the url field. @DavidK Do not forget to change your applications to use this new structure. Commented May 19, 2016 at 11:29
  • @marco: you are right, corrected ;) Commented May 19, 2016 at 11:35

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.