7

I have a somewhat involved query that splits strings and outputs each word as a record.

I did a quick test, one with a CTE and one with a subquery and was somewhat surprised to see that the CTE takes twice as long to execute.

Here is the gist of what the query does:

-- 1. translate matches characters from comment to given list (of symbols) and replaces them with commas.
-- 2. string_to_array splits string by comma and puts in an array
-- 3. unnest unpacks the array into rows

Inline subquery

SELECT
 sub_query.word,
 sub_query._created_at
FROM 
( SELECT unnest(string_to_array(translate(nps_reports.comment::text, ' ,.<>?/;:@#~[{]}=+-_)("*&^%$£!`\|}'::text, ',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'::text), ','::text, ''::text)) AS word,
 nps_reports.comment,
 nps_reports._id,
 nps_reports._created_at
 FROM nps_reports
 WHERE nps_reports.comment::text <> 'undefined'::text
) sub_query 
WHERE sub_query.word IS NOT NULL AND NOT (sub_query.word IN ( SELECT stop_words.stop_word FROM stop_words))
ORDER BY sub_query._created_at DESC;

CTE

WITH split AS
(
SELECT unnest(string_to_array(translate(nps_reports.comment::text, ' ,.<>?/;:@#~[{]}=+-_)("*&^%$£!`\|}'::text, ',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'::text), ','::text, ''::text)) AS word,
 nps_reports.comment,
 nps_reports._id,
 nps_reports._created_at
FROM nps_reports
WHERE nps_reports.comment::text <> 'undefined'::text
)
SELECT
 split.word,
 split._created_at
FROM split
WHERE split.word IS NOT NULL AND NOT (split.word IN ( SELECT stop_words.stop_word FROM stop_words))
ORDER BY split._created_at DESC;

And here are the EXPLAINs for each:

Subquery Explain

Sort (cost=15921589.76..16082302.91 rows=64285258 width=40) (actual time=16299.150..17697.914 rows=4394788 loops=1)
 Sort Key: sub_query._created_at DESC
 Sort Method: external merge Disk: 116112kB
 Buffers: shared hit=22915 read=7627, temp read=34281 written=34281
 -> Subquery Scan on sub_query (cost=2.49..2311035.10 rows=64285258 width=40) (actual time=0.177..13274.895 rows=4394788 loops=1)
 Filter: ((sub_query.word IS NOT NULL) AND (NOT (hashed SubPlan 1)))
 Rows Removed by Filter: 3676303
 Buffers: shared hit=22915 read=7627
 -> Seq Scan on nps_reports (cost=0.00..695825.11 rows=129216600 width=88) (actual time=0.073..9781.244 rows=8071091 loops=1)
 Filter: ((comment)::text <> 'undefined'::text)
 Rows Removed by Filter: 844360
 Buffers: shared hit=22914 read=7627
 SubPlan 1
 -> Seq Scan on stop_words (cost=0.00..2.19 rows=119 width=4) (actual time=0.016..0.034 rows=119 loops=1)
 Buffers: shared hit=1
Planning time: 0.115 ms
Execution time: 18451.245 ms

CTE Explain

Sort (cost=17213755.76..17374468.91 rows=64285258 width=40) (actual time=44008.467..45508.786 rows=4394788 loops=1)
 Sort Key: split._created_at DESC
 Sort Method: external merge Disk: 116112kB
 Buffers: shared hit=23031 read=7531, temp read=34281 written=353942
 CTE split
 -> Seq Scan on nps_reports (cost=0.00..695825.11 rows=129216600 width=135) (actual time=0.057..10451.951 rows=8071091 loops=1)
 Filter: ((comment)::text <> 'undefined'::text)
 Rows Removed by Filter: 844360
 Buffers: shared hit=23027 read=7531
 -> CTE Scan on split (cost=2.49..2907375.99 rows=64285258 width=40) (actual time=0.162..37888.364 rows=4394788 loops=1)
 Filter: ((word IS NOT NULL) AND (NOT (hashed SubPlan 2)))
 Rows Removed by Filter: 3676303
 Buffers: shared hit=23028 read=7531, temp written=319661
 SubPlan 2
 -> Seq Scan on stop_words (cost=0.00..2.19 rows=119 width=4) (actual time=0.009..0.030 rows=119 loops=1)
 Buffers: shared hit=1
Planning time: 0.649 ms
Execution time: 46297.825 ms
asked Oct 10, 2017 at 11:02
2
  • 2
    no json please. just show the text plan. Commented Oct 10, 2017 at 15:34
  • @Petar check out the query in my example, tell me how it performs. Commented Oct 10, 2017 at 15:44

2 Answers 2

8

CTE's in PostgreSQL are an optimization fence. That means the query planner doesn't push optimizations across a CTE boundary.

I think a lot of this is silly though you can just write it like this.. Here we use CROSS JOIN LATERAL rather than the complex wrapping and NOT EXISTS rather than NOT IN

SELECT word,
 _created_at
FROM nps_reports
CROSS JOIN LATERAL unnest(regexp_split_to_array(
 nps_reports.comment,
 '[^a-zA-Z0-9]+'
)) AS word
WHERE nps_reports.comment <> 'undefined'
 AND nps_reports.comment IS NOT NULL
 AND NOT EXISTS (
 SELECT 1
 FROM stop_words
 WHERE stop_words.stop_word = word
 )
ORDER BY _created_at DESC;

All of that said, whatever you're doing seems to be reinventing FTS. So that's also a bad idea.

answered Oct 10, 2017 at 15:36
5
  • 1
    Execution times: Yours: 22640.389 ms, Subquery: 18451.245 ms, CTE: 46297.825 ms Commented Oct 10, 2017 at 17:10
  • @Petar are you just trying to split all strings by [^a-zA-Z0-3] (non-alphanumerics?) Commented Oct 10, 2017 at 19:03
  • @Petar try the updated query, that should be substantially faster. Commented Oct 10, 2017 at 19:07
  • Hmm, I don't think regex is the way to go - it is an expensive operation - the updated query took 43638.959 ms. I am fairly confident that my original subquery is the optimal solution (apart from what FTS can offer). Commented Oct 11, 2017 at 9:40
  • This question was more towards CTE vs Subquery and I think you have answered it. For anyone that comes here looking for an optimum solution, see this question stackoverflow.com/questions/46687065/ And also see my answer below. I was able to reduce the whole operation to 10 secs Commented Oct 11, 2017 at 15:19
3

@Evan Carroll explained why the CTE takes longer but here is an improved query, that is faster than all solutions listed above.

See this question for more background.

-- create custom dict (you don't necessarily need to do this)
CREATE TEXT SEARCH DICTIONARY simple_with_stop_words (TEMPLATE = pg_catalog.simple, STOPWORDS = english);
CREATE TEXT SEARCH CONFIGURATION public.simple_with_stop_words (COPY = pg_catalog.simple);
ALTER TEXT SEARCH CONFIGURATION public.simple_with_stop_words ALTER MAPPING FOR asciiword WITH simple_with_stop_words;
-- the actual query
SELECT 
 token.word, 
 nps._created_at
FROM nps_reports nps CROSS JOIN LATERAL UNNEST(to_tsvector('simple_with_stop_words', nps.comment)) token(word)
WHERE nps.comment IS NOT NULL AND
 nps.comment <> 'undefined' AND
 nps.language = 'en-US';

This utilises PostgreSQL's to_tsvector function which does several things depending on the configuration given to it. If used with the simple dictionary, instead of the custom one I made, it will simply split any string into words.

I am also utilising a feature from Postgres 9.3+, the LATERAL keyword, which enables me to pass an argument from the left side of the join to the right side of the join, i.e.: I can pass the comment into UNNEST.

This takes about 10 seconds to execute over the whole database. Compare to the previous fastest method (subquery), which took 18 seconds.

answered Oct 11, 2017 at 15:26

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.