Poor query performance over timestamp range by week in PostgreSQL 9.3

Question 1

I have a slow query that generates a report of account activity per week over the past year. The table currently has nearly 5 million rows and this query currently takes 8 seconds to execute. The (current) bottleneck is the sequential scan over the timestamp range.

account=> EXPLAIN ANALYZE SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
 GroupAggregate (cost=450475.76..513465.44 rows=2290534 width=12) (actual time=7524.474..8003.291 rows=52 loops=1)
 Group Key: ((to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
 -> Sort (cost=450475.76..456202.09 rows=2290534 width=12) (actual time=7519.053..7691.924 rows=2314164 loops=1)
 Sort Key: ((to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
 Sort Method: external sort Disk: 40704kB
 -> Seq Scan on account_history (cost=0.00..169364.81 rows=2290534 width=12) (actual time=1470.438..6222.076 rows=2314164 loops=1)
 Filter: ((event_time <= now()) AND (event_time >= (now() - '357 days'::interval)))
 Rows Removed by Filter: 2591679
 Planning time: 0.126 ms
 Execution time: 8011.160 ms

The table:

account=> \d account_history
 Table "public.account_history"
 Column | Type | Modifiers
-------------+-----------------------------+---------------------------
 account | integer | not null
 event_code | text | not null
 event_time | timestamp without time zone | not null default now()
 description | text | not null default ''::text
Indexes:
 "account_history_idx" btree (account, event_time DESC)
 "account_id_idx" btree (account, event_code, event_time)
Foreign-key constraints:
 "account_fk" FOREIGN KEY (account) REFERENCES account(id) ON UPDATE CASCADE ON DELETE RESTRICT
 "event_code_fk" FOREIGN KEY (event_code) REFERENCES domain_account_event(code) ON UPDATE CASCADE ON DELETE RESTRICT

When I originally created this table, I added the timestamp column as part of a btree index, but I figured that the sequential scan was due to the (then) small number of rows in the table (see related question).

However, now the table has grown into the millions, I've noticed a performance issue with the query, and discovered that the index is not being used in the query.

I tried adding an ordered index as recommended here, but that is clearly not being used in the execution plan either.

Is there a better way to index this table, or is there something inherent in my query that is bypassing both of these indices?

Update: When I add an index over only the timestamp, that index is used. However, it only reduced the execution time by 25%:

account=> CREATE INDEX account_history_time_idx ON account_history (event_time DESC);
account=> EXPLAIN ANALYZE VERBOSE SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
 GroupAggregate (cost=391870.30..454870.16 rows=2290904 width=12) (actual time=5481.930..6104.838 rows=52 loops=1)
 Output: ((to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date), count(DISTINCT account)
 Group Key: ((to_timestamp(to_char(date_trunc('week'::text, account_history.event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
 -> Sort (cost=391870.30..397597.56 rows=2290904 width=12) (actual time=5474.181..5771.903 rows=2314038 loops=1)
 Output: ((to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date), account
 Sort Key: ((to_timestamp(to_char(date_trunc('week'::text, account_history.event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
 Sort Method: external merge Disk: 40688kB
 -> Index Scan using account_history_time_idx on public.account_history (cost=0.44..110710.59 rows=2290904 width=12) (actual time=0.108..4352.143 rows=2314038 loops=1)
 Output: (to_timestamp(to_char(date_trunc('week'::text, event_time), 'IYYY-IW'::text), 'IYYY-IW'::text))::date, account
 Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
 Planning time: 0.204 ms
 Execution time: 6112.832 ms

https://explain.depesz.com/s/PSfU

I also tried VACUUM FULL as suggested here, but it made no difference in the execution time.

Here are execution plans for some simpler queries over the same table:

Simply counting the rows takes 0.5 seconds:

account=> EXPLAIN ANALYZE VERBOSE SELECT COUNT(*) FROM account_history;
 Aggregate (cost=97401.04..97401.05 rows=1 width=0) (actual time=551.179..551.179 rows=1 loops=1)
 Output: count(*)
 -> Seq Scan on public.account_history (cost=0.00..85136.43 rows=4905843 width=0) (actual time=0.039..344.675 rows=4905843 loops=1)
 Output: account, event_code, event_time, description
 Planning time: 0.075 ms
 Execution time: 551.209 ms

And using the same time range clause takes less than one second:

account=> EXPLAIN ANALYZE VERBOSE SELECT COUNT(*) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now();
 Aggregate (cost=93527.57..93527.58 rows=1 width=0) (actual time=997.436..997.436 rows=1 loops=1)
 Output: count(*)
 -> Index Only Scan using account_history_time_idx on public.account_history (cost=0.44..87800.45 rows=2290849 width=0) (actual time=0.100..897.776 rows=2313987 loops=1)
 Output: event_time
 Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
 Heap Fetches: 2313987
 Planning time: 0.239 ms
 Execution time: 997.473 ms

Based on the comments, I tried a simplified form of the query:

account=> EXPLAIN ANALYZE VERBOSE SELECT date_trunc('week', event_time) AS date, count(DISTINCT account) FROM account_history
WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
 GroupAggregate (cost=374676.22..420493.00 rows=2290839 width=12) (actual time=2475.556..3078.191 rows=52 loops=1)
 Output: (date_trunc('week'::text, event_time)), count(DISTINCT account)
 Group Key: (date_trunc('week'::text, account_history.event_time))
 -> Sort (cost=374676.22..380403.32 rows=2290839 width=12) (actual time=2468.654..2763.739 rows=2313977 loops=1)
 Output: (date_trunc('week'::text, event_time)), account
 Sort Key: (date_trunc('week'::text, account_history.event_time))
 Sort Method: external merge Disk: 49720kB
 -> Index Scan using account_history_time_idx on public.account_history (cost=0.44..93527.35 rows=2290839 width=12) (actual time=0.094..1537.488 rows=2313977 loops=1)
 Output: date_trunc('week'::text, event_time), account
 Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
 Planning time: 0.220 ms
 Execution time: 3086.828 ms
(12 rows)
account=> SELECT date_trunc('week', current_date) AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWE
EN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
 date | count
------------------------+-------
 2017年10月23日 00:00:00-04 | 132
(1 row)

Indeed, that cut the execution time in half, but unfortunately does not give the desired results, as in:

account=> SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
 date | count
------------+-------
 2016年10月31日 | 14
...
 2017年10月23日 | 584
(52 rows)

If I can find a cheaper way to aggregate these records by week, that will go a long way toward solving this problem.

I am open to any suggestions on improving the performance of the weekly query with the GROUP BY clause, including altering the table.

I created a materialized view as a test, but of course refreshing it takes exactly the same amount of time as the original query, so unless I only refresh it a few times per day, it doesn't really help, at the cost of adding complexity:

account=> CREATE MATERIALIZED VIEW account_activity_weekly AS SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
SELECT 52

Based on an additional comment, I have revised my query as follows, which cut the execution time in half and does deliver the expected result set:

account=> EXPLAIN ANALYZE VERBOSE SELECT to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date_trunc('week', event_time) ORDER BY date;
 Sort (cost=724523.11..730249.97 rows=2290745 width=12) (actual time=3188.495..3188.496 rows=52 loops=1)
 Output: ((to_timestamp(to_char((date_trunc('week'::text, event_time)), 'IYYY-IW'::text), 'IYYY-IW'::text))::date), (count(DISTINCT account)), (date_trunc('week'::text, event_time))
 Sort Key: ((to_timestamp(to_char((date_trunc('week'::text, account_history.event_time)), 'IYYY-IW'::text), 'IYYY-IW'::text))::date)
 Sort Method: quicksort Memory: 29kB
 -> GroupAggregate (cost=374662.50..443384.85 rows=2290745 width=12) (actual time=2573.694..3188.451 rows=52 loops=1)
 Output: (to_timestamp(to_char((date_trunc('week'::text, event_time)), 'IYYY-IW'::text), 'IYYY-IW'::text))::date, count(DISTINCT account), (date_trunc('week'::text, event_time))
 Group Key: (date_trunc('week'::text, account_history.event_time))
 -> Sort (cost=374662.50..380389.36 rows=2290745 width=12) (actual time=2566.086..2859.590 rows=2313889 loops=1)
 Output: (date_trunc('week'::text, event_time)), event_time, account
 Sort Key: (date_trunc('week'::text, account_history.event_time))
 Sort Method: external merge Disk: 67816kB
 -> Index Scan using account_history_time_idx on public.account_history (cost=0.44..93524.23 rows=2290745 width=12) (actual time=0.090..1503.985 rows=2313889 loops=1)
 Output: date_trunc('week'::text, event_time), event_time, account
 Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
 Planning time: 0.205 ms
 Execution time: 3198.125 ms
(16 rows)

Question 2

Unrelated, but: to_timestamp(to_char(date_trunc('week', event_time), 'IYYY-IW'), 'IYYY-IW')::date can be simplified to date_trunc('week', current_date)

Question 3

@a_horse_with_no_name I appreciate the suggestion, but that aggregates at the wrong granularity, returning only a single row in the result set when I expect exactly 52 rows (one year by week). Results: 2017年10月23日 00:00:00-04 | 132 (1 row) vs. 2016年10月31日 | 14 ... 2017年10月23日 | 584 (52 rows).

Question 4

@a_horse_with_no_name Actually, you might be on to something there. The simpler form of the date_trunc does cut the query time in half. However, it doesn't give me the result set I'm looking for.

Question 5

Try using @a_horse_with_no_name's formula in GROUP BY, leavibg the SELECT as it is: group by date_trunc('week', event_time)

Question 6

I suggested the index with the opposite order. The one you have might be used - but not for this query. Perhaps if it was rewritten. The row number info is good but I asked how many distinct accounts there are.

Question 7

Thanks to those who contributed in the comments, I have reduced the query time from ~8000 ms to ~1650 ms by:

Adding an index on the timestamp column only (~2000 ms improvement).
Removing the extra timestamp-to-char-to-timestamp conversion (or adding date_trunc('week', event_time) to the GROUP BY clause) (~3000 ms improvement).

For reference, the current table structure and execution plan are below.

I did play around with the other variations of indexing over multiple columns, but none of those indices were used by the execution plan.

In addition, I took the advice of another comment and took the following steps (followed by VACUUM and REINDEX):

Dropped the constraints from the description column and set all empty strings to NULL
Converted the timestamp column from WITHOUT TIME ZONE to WITH TIME ZONE
Increased the work_mem to 100MB (via postgresql.conf).

ALTER TABLE account_history ALTER event_time TYPE timestamptz USING event_time AT TIME ZONE 'UTC';
ALTER TABLE account_history ALTER COLUMN description DROP NOT NULL;
ALTER TABLE account_history ALTER COLUMN description DROP DEFAULT;
UPDATE account_history SET description=NULL WHERE description='';
VACUUM FULL;
REINDEX TABLE account_history;
account=> show work_mem;
 work_mem
----------
 100MB

These additional changes shaved another 400ms from the execution time and also cut down the planning time. One thing to note is that the sort method has changed from "external sort" to "external merge". Since the 'Disk' was still being used for the sort, I increased the work_mem to 200MB, which resulted in the quicksort (memory) method being used (176MB). This dropped a full second off the execution time (although this is really too high to be used on our server instances).

The updated table and execution plan are below.

account=> \d account_history
 Table "public.account_history"
 Column | Type | Modifiers
-------------+--------------------------+------------------------
 account | integer | not null
 event_code | text | not null
 event_time | timestamp with time zone | not null default now()
 description | text |
Indexes:
 "account_history_account_idx" btree (account)
 "account_history_account_time_idx" btree (event_time DESC, account)
 "account_history_time_idx" btree (event_time DESC)
Foreign-key constraints:
 "account_fk" FOREIGN KEY (account) REFERENCES account(id) ON UPDATE CASCADE ON DELETE RESTRICT
 "event_code_fk" FOREIGN KEY (event_code) REFERENCES domain_account_event(code) ON UPDATE CASCADE ON DELETE RESTRICT

account=> EXPLAIN ANALYZE VERBOSE SELECT date_trunc('week', event_time) AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
 GroupAggregate (cost=334034.60..380541.52 rows=2325346 width=12) (actual time=1307.742..1685.676 rows=52 loops=1)
 Output: (date_trunc('week'::text, event_time)), count(DISTINCT account)
 Group Key: (date_trunc('week'::text, account_history.event_time))
 -> Sort (cost=334034.60..339847.97 rows=2325346 width=12) (actual time=1303.565..1361.540 rows=2312418 loops=1)
 Output: (date_trunc('week'::text, event_time)), account
 Sort Key: (date_trunc('week'::text, account_history.event_time))
 Sort Method: quicksort Memory: 176662kB
 -> Index Only Scan using account_history_account_time_idx on public.account_history (cost=0.44..88140.73 rows=2325346 width=12) (actual time=0.028..980.822 rows=2312418 loops=1)
 Output: date_trunc('week'::text, event_time), account
 Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
 Heap Fetches: 0
 Planning time: 0.153 ms
 Execution time: 1697.824 ms

I am very happy with the improvements thus far, but I welcome any other contributions to improving the performance of this query, since this is still the slowest query I have in one of my views.

Question 8

Why are you still doing to_char? Get rid of to_timestamp(to_char()) and just use date_trunc()

Question 9

You shouldn't be using timestamp without time zone you should be using timestamp with time zone. You should also permit description to be null and set '' to null. You should cluster on event_time. also what is work_mem. Run show work_mem;.

Question 10

So if they're in different time zones, you would want them to pull out the time stamps in the timezones specified in the clients TIME ZONE. That's what with time zone does. As for 1 MB work_mem that's way too low. Raise that to 100 MB: ALTER SYSTEM SET work_mem = 100mb; SELECT pg_reload_conf();

Question 11

try restarting the service and checking to see if work_mem is still 100mb. Did you cluster on date time?

Question 12

That's all the easy options though. Consider dropping account_history_time_idx because you're not using it. If you need something substantially faster than 1697.824 you'll have to materialize it. Such as caching the the dates and unique customers for every day, and then joining from the cache to get the last completed day and then finishing up with the query from there.

Question 13

To solve a quick date between query issue. I converted dates to Unix time (UTC) (I only needed "second" accurancy but you could go finer if you need to) Then create a method to convert your dates to a bigint/long (include your Timezone conversion here). Then run your query and just search between the 2 integers. May sound a bit wild but works like a dream.

Question 14

I don't think that converting a timestamp to an epoch and comparing those bigints really improved things. I would rather guess that while you were doing that, you probably changed the way the query works or you created additional indexes that weren't there to begin with. An index on a timestamp column will be just as efficient as an index on a bigint column. And you don't really need to create a function to convert between unix epoch and a proper timestamp - Postgres already provides functions for that

Question 15

I managed to get a 55x speedup on a query by converting part of a where clause from "ts between timestamp <iso stamp1> and <iso stamp2>" to "extract(epoch from ts) between <unix int1> and <unix int2>. The "ts" column in this case is a timestamp that is not timezone aware. This seems kinda insane.

Autumn Skye Autumn Skye 3793 gold badges9 silver badges24 bronze badges · Answer 1 · 2017-10-24 15:07:13Z

Thanks to those who contributed in the comments, I have reduced the query time from ~8000 ms to ~1650 ms by:

Adding an index on the timestamp column only (~2000 ms improvement).
Removing the extra timestamp-to-char-to-timestamp conversion (or adding date_trunc('week', event_time) to the GROUP BY clause) (~3000 ms improvement).

For reference, the current table structure and execution plan are below.

I did play around with the other variations of indexing over multiple columns, but none of those indices were used by the execution plan.

In addition, I took the advice of another comment and took the following steps (followed by VACUUM and REINDEX):

Dropped the constraints from the description column and set all empty strings to NULL
Converted the timestamp column from WITHOUT TIME ZONE to WITH TIME ZONE
Increased the work_mem to 100MB (via postgresql.conf).

ALTER TABLE account_history ALTER event_time TYPE timestamptz USING event_time AT TIME ZONE 'UTC';
ALTER TABLE account_history ALTER COLUMN description DROP NOT NULL;
ALTER TABLE account_history ALTER COLUMN description DROP DEFAULT;
UPDATE account_history SET description=NULL WHERE description='';
VACUUM FULL;
REINDEX TABLE account_history;
account=> show work_mem;
 work_mem
----------
 100MB

These additional changes shaved another 400ms from the execution time and also cut down the planning time. One thing to note is that the sort method has changed from "external sort" to "external merge". Since the 'Disk' was still being used for the sort, I increased the work_mem to 200MB, which resulted in the quicksort (memory) method being used (176MB). This dropped a full second off the execution time (although this is really too high to be used on our server instances).

The updated table and execution plan are below.

account=> \d account_history
 Table "public.account_history"
 Column | Type | Modifiers
-------------+--------------------------+------------------------
 account | integer | not null
 event_code | text | not null
 event_time | timestamp with time zone | not null default now()
 description | text |
Indexes:
 "account_history_account_idx" btree (account)
 "account_history_account_time_idx" btree (event_time DESC, account)
 "account_history_time_idx" btree (event_time DESC)
Foreign-key constraints:
 "account_fk" FOREIGN KEY (account) REFERENCES account(id) ON UPDATE CASCADE ON DELETE RESTRICT
 "event_code_fk" FOREIGN KEY (event_code) REFERENCES domain_account_event(code) ON UPDATE CASCADE ON DELETE RESTRICT

account=> EXPLAIN ANALYZE VERBOSE SELECT date_trunc('week', event_time) AS date, count(DISTINCT account) FROM account_history WHERE event_time BETWEEN now() - interval '51 weeks' AND now() GROUP BY date ORDER BY date;
 GroupAggregate (cost=334034.60..380541.52 rows=2325346 width=12) (actual time=1307.742..1685.676 rows=52 loops=1)
 Output: (date_trunc('week'::text, event_time)), count(DISTINCT account)
 Group Key: (date_trunc('week'::text, account_history.event_time))
 -> Sort (cost=334034.60..339847.97 rows=2325346 width=12) (actual time=1303.565..1361.540 rows=2312418 loops=1)
 Output: (date_trunc('week'::text, event_time)), account
 Sort Key: (date_trunc('week'::text, account_history.event_time))
 Sort Method: quicksort Memory: 176662kB
 -> Index Only Scan using account_history_account_time_idx on public.account_history (cost=0.44..88140.73 rows=2325346 width=12) (actual time=0.028..980.822 rows=2312418 loops=1)
 Output: date_trunc('week'::text, event_time), account
 Index Cond: ((account_history.event_time >= (now() - '357 days'::interval)) AND (account_history.event_time <= now()))
 Heap Fetches: 0
 Planning time: 0.153 ms
 Execution time: 1697.824 ms

I am very happy with the improvements thus far, but I welcome any other contributions to improving the performance of this query, since this is still the slowest query I have in one of my views.

Why are you still doing to_char? Get rid of to_timestamp(to_char()) and just use date_trunc()
You shouldn't be using timestamp without time zone you should be using timestamp with time zone. You should also permit description to be null and set '' to null. You should cluster on event_time. also what is work_mem. Run show work_mem;.
So if they're in different time zones, you would want them to pull out the time stamps in the timezones specified in the clients TIME ZONE. That's what with time zone does. As for 1 MB work_mem that's way too low. Raise that to 100 MB: ALTER SYSTEM SET work_mem = 100mb; SELECT pg_reload_conf();
try restarting the service and checking to see if work_mem is still 100mb. Did you cluster on date time?
That's all the easy options though. Consider dropping account_history_time_idx because you're not using it. If you need something substantially faster than 1697.824 you'll have to materialize it. Such as caching the the dates and unique customers for every day, and then joining from the cache to get the last completed day and then finishing up with the query from there.

Michael Michael 91 bronze badge · Answer 2 · 2019-03-04 08:56:57Z

0

To solve a quick date between query issue. I converted dates to Unix time (UTC) (I only needed "second" accurancy but you could go finer if you need to) Then create a method to convert your dates to a bigint/long (include your Timezone conversion here). Then run your query and just search between the 2 integers. May sound a bit wild but works like a dream.

Share

Improve this answer

answered Mar 4, 2019 at 8:56

Michael's user avatar

Michael Michael

91 bronze badge

2

2

I don't think that converting a timestamp to an epoch and comparing those bigints really improved things. I would rather guess that while you were doing that, you probably changed the way the query works or you created additional indexes that weren't there to begin with. An index on a timestamp column will be just as efficient as an index on a bigint column. And you don't really need to create a function to convert between unix epoch and a proper timestamp - Postgres already provides functions for that

user1822
– user1822

2019年03月04日 10:04:08 +00:00
Commented Mar 4, 2019 at 10:04
I managed to get a 55x speedup on a query by converting part of a where clause from "ts between timestamp <iso stamp1> and <iso stamp2>" to "extract(epoch from ts) between <unix int1> and <unix int2>. The "ts" column in this case is a timestamp that is not timezone aware. This seems kinda insane.

jlucier
– jlucier

2022年01月13日 19:52:46 +00:00
Commented Jan 13, 2022 at 19:52

Add a comment |

Stack Exchange Network

Poor query performance over timestamp range by week in PostgreSQL 9.3

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Poor query performance over timestamp range by week in PostgreSQL 9.3

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions