3

I have a very simple, but very big, table. Its schema is like this

(yadda int, yadda1 int, yaddate date, ... other stuff).

Now, yaddate has an index by itself and it is also in other indexes together with other columns (eg. (yadda1, date)).

The table itself is some 100M rows.

When I run

 select distinct date from mybigtable;

the time needed to get the list is in the range of 200 seconds. Explain Analyze tells me it's doing a seq scan and I don't understand why, since I the index is there.

First thing I am trying is reindex on the date only column index.

  1. Am I doing something wrong?
  2. Since obviously there's something I am missing about seq and index scan, can someone shed some light?
  3. How can I make that query faster?

TIA.

asked Apr 15, 2014 at 17:10
2

2 Answers 2

2

There is a trick with distinct to get it fast using index, that you can try. It involves creating a function looking like that:

CREATE OR REPLACE FUNCTION small_distinct(IN tablename character varying, IN fieldname character varying, IN sample anyelement DEFAULT '1800-01-01'::date)
 RETURNS SETOF anyelement AS
$BODY$
BEGIN
 EXECUTE 'SELECT '||fieldName||' FROM '||tableName||' ORDER BY '||fieldName
 ||' LIMIT 1' INTO result;
 WHILE result IS NOT NULL LOOP
 RETURN NEXT;
 EXECUTE 'SELECT '||fieldName||' FROM '||tableName
 ||' WHERE '||fieldName||' > 1ドル ORDER BY ' || fieldName || ' LIMIT 1'
 INTO result USING result;
 END LOOP;
END;
$BODY$
 LANGUAGE plpgsql VOLATILE
 COST 100
 ROWS 1000;

Then create an index on the column you want to count distinct, and select small_distinct('yourtable', 'yaddate'); should return you the distinct values you want, without the need to read the table.

Try it, be beware, I'm not sure it will work right out of the box, as I quickly adapted it from a varchar function.

answered Apr 21, 2014 at 16:13
1
  • Nice trick! Might try that myself! Commented Oct 29, 2017 at 9:18
1

For this query:

select distinct date from mybigtable;

or its twin:

select date from mybigtable group by 1;

... the whole table has to be read. Postgres is not going to use any index, except, possibly, a covering index that is substantially smaller than the table itself. Postgres Wiki on slow counting.

Also, to be precise, that's not a count. If you are after an actual count, an estimate might be enough, which can be had much faster. Postgres Wiki on count estimates.

If you provide more details of what you have and want you want, there might be workarounds with a materialized view or a lookup table ...

answered Apr 15, 2014 at 19:21
1
  • Hi, thanks. What I have is: various data with dates. Data always refers to the first of the month (date is always in the format YYYY-MM-1), so what I want is the set of months this dataset covers. I am not counting, I really want the set of N dates that are in the dataset. Does this help? Commented Apr 15, 2014 at 22:25

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.