Postgres Full Text Search With Short Word Matching

Question 1

I'm using Postgres' full text search to return results, but it seems any query with 3 characters or fewer returns no results. Is there a way to configure Postgres so that I can query with 2 and 3 character long strings? My data set is extremely small, so performance would not be an issue.

For reference, here's my table definition:

CREATE TABLE doctokens (
 id serial primary key,
 inner_text TEXT NOT NULL,
 tokens tsvector
)

INSERT INTO doctokens (inner_text) VALUES ('Route builders give you a fluent syntax')

After an insert, I create the token tsvector:

UPDATE doctokens SET tokens = setweight(to_tsvector('english', COALESCE(inner_text, '')), 'A')

And here's how I'm querying the table:

SELECT ts_headline('english', inner_text, query, 'StartSel = <em>, StopSel = </em>, ShortWord = 1') as inner_text_highlights
FROM (SELECT inner_text, ts_rank_cd(tokens, query) AS rank, query
 FROM doctokens, plainto_tsquery('english', 'rou') AS query
 WHERE tokens @@ query
 ORDER BY rank DESC
 LIMIT 5) AS query_results;

Question 2

I have other rows that contain something like Routing (inserted the same way as above), and searching for routi does not yield any results, either. So, maybe what I'm running into is just a limitation of the capabilities of Postgres' full text search?

Question 3

'route' and 'routing' stem to 'rout'. Neither 'rou' nor 'routi' do. So, they don't match.

You can use the prefix match to make 'rou' match, but plainto_tsquery doesn't understand that so you have to make the tsquery yourself:

SELECT ts_headline('english', inner_text, query, 'StartSel = <em>, StopSel = </em>, ShortWord = 1') as inner_text_highlights
FROM (SELECT inner_text, ts_rank_cd(tokens, query) AS rank, query
 FROM doctokens, (select 'rou:*'::tsquery as query ) AS query
 WHERE tokens @@ query
 ORDER BY rank DESC
 LIMIT 5) AS query_results;
 inner_text_highlights 
--------------------------------------------------
 <em>Route</em> builders give you a fluent syntax
 Does <em>Routing</em> rhyme with looting?

Prefix matching won't let you match 'routi:*', because the things it would match against get the 'i' stripped off when stemming.

Question 4

If I wanted to get the benefits of both search methods, should I just have two separate queries in my SELECT - once with plainto_tsquery() and one with prefix matching?

Question 5

@Dave You could do that, probably by applying a UNION ALL (rather than an OR). But I think that the ts_rank_cd function, executed on two different query strings, might not return comparable results, so you might not want to sort them together. I'd probably just execute the plainto_tsquery one first, and then if it returned no rows, try the other one. (Or give the user a checkbox, 'plain' or 'advanced')

Question 6

Is there any way to make routi also match, or is that not really possible with tsquery? I know I could resort to using WHERE inner_text LIKE '%routi%', but was curious if there is any way to make it work without using LIKE.

Question 7

You could use a configuration which doesn't do stemming, like 'simple' rather than 'english'. But at that point, I don't know if you are really doing full text search anymore. If is really important, you could keep two tsvector columns or FTS indexes, with different configurations to choose between them. Or one FTS and one pg_trgm: You can search with LIKE and rank with <->, but pg_trgm doesn't offer something like ts_headline.

Question 8

Thanks. That's the route I ultimately took - I created an English tsvector and a simple tsvector at insertion. Then, I query the English lexemes with plainto_tsquery('english', :query) and the simple lexemes using an approach similar to what you laid out, union the two data sets together, and grab the top 5 with the highest rank.

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Accepted Answer · 2019-09-02 20:59:48Z

2

'route' and 'routing' stem to 'rout'. Neither 'rou' nor 'routi' do. So, they don't match.

You can use the prefix match to make 'rou' match, but plainto_tsquery doesn't understand that so you have to make the tsquery yourself:

SELECT ts_headline('english', inner_text, query, 'StartSel = <em>, StopSel = </em>, ShortWord = 1') as inner_text_highlights
FROM (SELECT inner_text, ts_rank_cd(tokens, query) AS rank, query
 FROM doctokens, (select 'rou:*'::tsquery as query ) AS query
 WHERE tokens @@ query
 ORDER BY rank DESC
 LIMIT 5) AS query_results;
 inner_text_highlights 
--------------------------------------------------
 <em>Route</em> builders give you a fluent syntax
 Does <em>Routing</em> rhyme with looting?

Prefix matching won't let you match 'routi:*', because the things it would match against get the 'i' stripped off when stemming.

Share

Improve this answer

answered Sep 2, 2019 at 20:59

jjanes's user avatar

jjanes jjanes

42.4k3 gold badges44 silver badges54 bronze badges

5

If I wanted to get the benefits of both search methods, should I just have two separate queries in my SELECT - once with plainto_tsquery() and one with prefix matching?

Dave
– Dave

2019年09月02日 21:53:23 +00:00
Commented Sep 2, 2019 at 21:53
@Dave You could do that, probably by applying a UNION ALL (rather than an OR). But I think that the ts_rank_cd function, executed on two different query strings, might not return comparable results, so you might not want to sort them together. I'd probably just execute the plainto_tsquery one first, and then if it returned no rows, try the other one. (Or give the user a checkbox, 'plain' or 'advanced')

jjanes
– jjanes

2019年09月02日 22:05:26 +00:00
Commented Sep 2, 2019 at 22:05
Is there any way to make routi also match, or is that not really possible with tsquery? I know I could resort to using WHERE inner_text LIKE '%routi%', but was curious if there is any way to make it work without using LIKE.

Dave
– Dave

2019年09月02日 22:20:31 +00:00
Commented Sep 2, 2019 at 22:20
You could use a configuration which doesn't do stemming, like 'simple' rather than 'english'. But at that point, I don't know if you are really doing full text search anymore. If is really important, you could keep two tsvector columns or FTS indexes, with different configurations to choose between them. Or one FTS and one pg_trgm: You can search with LIKE and rank with <->, but pg_trgm doesn't offer something like ts_headline.

jjanes
– jjanes

2019年09月03日 01:11:16 +00:00
Commented Sep 3, 2019 at 1:11
Thanks. That's the route I ultimately took - I created an English tsvector and a simple tsvector at insertion. Then, I query the English lexemes with plainto_tsquery('english', :query) and the simple lexemes using an approach similar to what you laid out, union the two data sets together, and grab the top 5 with the highest rank.

Dave
– Dave

2019年09月03日 14:50:09 +00:00
Commented Sep 3, 2019 at 14:50

Add a comment |

Stack Exchange Network

Postgres Full Text Search With Short Word Matching

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Postgres Full Text Search With Short Word Matching

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions