How to search hyphenated words in PostgreSQL full text search?

Question 1

I have to search for hyphenated words like 'good-morning', 'good-evening', etc.

My query is:

select id, ts_headline(content,
 to_tsquery('english','good-morning'),
 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') 
from table 
where ts_content @@ to_tsquery('english','good-morning');

When executing this query I also get results of 'good' and 'morning' separately. But I want exactly matching words and fragments.
(For ts_content I used the same default config english to create the tsvector.)

How can I search such hyphenated words in PostgreSQL full text search?

Question 2

Assuming you run at least Postgres 9.6? (Please always declare your version of Postgres.)

Question 3

The key word here is phrase search, introduced with Postgres 9.6.

Use the "FOLLOWED BY" operator <-> or one of the related <N> operators. Or better yet, use the function phraseto_tsquery() to generate your tsquery.
Quoting the manual, it ...

produces tsquery that searches for a phrase, ignoring punctuation

And:

phraseto_tsquery behaves much like plainto_tsquery, except that it inserts the <-> (FOLLOWED BY) operator between surviving words instead of the & (AND) operator. Also, stop words are not simply discarded, but are accounted for by inserting <N> operators rather than <-> operators. This function is useful when searching for exact lexeme sequences, since the FOLLOWED BY operators check lexeme order not just the presence of all the lexemes.

Your query would work like this:

select id
 , ts_headline(content, phraseto_tsquery('english', 'good-morning')
 , 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') 
from tbl 
where ts_content @@ phraseto_tsquery('english','good-morning');

phraseto_tsquery('english', 'good-morning') generates this tsquery:

'good-morn' <-> 'good' <-> 'morn'

Since "good-morning" is identified as asciihword (hyphenated ASCII word), the stemmed complete word is added before the components. The manual:

It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component: (followed by an example)

to_tsvector() basically does the same on the other end, so everything matches up. This allows for fine-grained options with hyphenated words. The above only finds "good-morning" with a hyphen (or variants stemming to the same). To find all strings with "good" followed by "morn" (or variants stemming to the same) use phraseto_tsquery('english','good morning') generating this tsquery: 'good' <-> 'morn'

OTOH, you can enforce exact matches by adding another filter like:

...
AND content ~* 'good-morning' -- case insensitive regexp match

Or:

...
AND content ILIKE '%good-morning%'

Seems a bit redundant to the human eye, but this way you get fast full text index support and exact matches.

The latter is mostly equivalent, but different (fewer) characters have special meaning in the LIKE pattern and might need escaping. Related:

Example to demonstrate the operator <N>:

phraseto_tsquery('english', 'Juliet and the Licks') generates this tsquery:

'juliet' <3> 'lick'

<3> meaning that lick must be the third lexeme after juliet.

Question 4

Query:

select id , ts_headline(content, phraseto_tsquery('english', 'rhus-t') , 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') from vqbooks where ts_content @@ phraseto_tsquery('english','rhus-t');

result: " Lyss..,, Puls., Rhus-t., Sabad., " and " 'infant may have a Rhus toxicodendron picture. (NB: Rhus-t desires milk) I don't want to highlight have a Rhus toxicodendron". I want only first fragment to be highlighted.

Question 5

@user3098231: A small setting for the option MaxFragments might help some. But I am afraid that phrase search is not currently supported well in ts_headline(). A bug has been reported. See: dba.stackexchange.com/q/204856/3684

Question 6

phraseto_tsquery('english', 'good-morning') produces 'good-morn' <-> 'good' <-> 'morn', not 'good' <-> 'morn'. How are you getting this result? (I'm on Postgres 10, windows)

Question 7

@dtheodor: Good catch. I rectified the error and added proper information.

score 10 · Accepted Answer · 2018-04-21 12:54:41Z

The key word here is phrase search, introduced with Postgres 9.6.

Use the "FOLLOWED BY" operator <-> or one of the related <N> operators. Or better yet, use the function phraseto_tsquery() to generate your tsquery.
Quoting the manual, it ...

produces tsquery that searches for a phrase, ignoring punctuation

And:

phraseto_tsquery behaves much like plainto_tsquery, except that it inserts the <-> (FOLLOWED BY) operator between surviving words instead of the & (AND) operator. Also, stop words are not simply discarded, but are accounted for by inserting <N> operators rather than <-> operators. This function is useful when searching for exact lexeme sequences, since the FOLLOWED BY operators check lexeme order not just the presence of all the lexemes.

Your query would work like this:

select id
 , ts_headline(content, phraseto_tsquery('english', 'good-morning')
 , 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') 
from tbl 
where ts_content @@ phraseto_tsquery('english','good-morning');

phraseto_tsquery('english', 'good-morning') generates this tsquery:

'good-morn' <-> 'good' <-> 'morn'

Since "good-morning" is identified as asciihword (hyphenated ASCII word), the stemmed complete word is added before the components. The manual:

It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component: (followed by an example)

to_tsvector() basically does the same on the other end, so everything matches up. This allows for fine-grained options with hyphenated words. The above only finds "good-morning" with a hyphen (or variants stemming to the same). To find all strings with "good" followed by "morn" (or variants stemming to the same) use phraseto_tsquery('english','good morning') generating this tsquery: 'good' <-> 'morn'

OTOH, you can enforce exact matches by adding another filter like:

...
AND content ~* 'good-morning' -- case insensitive regexp match

Or:

...
AND content ILIKE '%good-morning%'

Seems a bit redundant to the human eye, but this way you get fast full text index support and exact matches.

The latter is mostly equivalent, but different (fewer) characters have special meaning in the LIKE pattern and might need escaping. Related:

Example to demonstrate the operator <N>:

phraseto_tsquery('english', 'Juliet and the Licks') generates this tsquery:

'juliet' <3> 'lick'

<3> meaning that lick must be the third lexeme after juliet.

Query: select id , ts_headline(content, phraseto_tsquery('english', 'rhus-t') , 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') from vqbooks where ts_content @@ phraseto_tsquery('english','rhus-t'); result: " Lyss..,, Puls., Rhus-t., Sabad., " and " 'infant may have a Rhus toxicodendron picture. (NB: Rhus-t desires milk) I don't want to highlight have a Rhus toxicodendron". I want only first fragment to be highlighted.
@user3098231: A small setting for the option MaxFragments might help some. But I am afraid that phrase search is not currently supported well in ts_headline(). A bug has been reported. See: dba.stackexchange.com/q/204856/3684
phraseto_tsquery('english', 'good-morning') produces 'good-morn' <-> 'good' <-> 'morn', not 'good' <-> 'morn'. How are you getting this result? (I'm on Postgres 10, windows)
@dtheodor: Good catch. I rectified the error and added proper information.

Stack Exchange Network

How to search hyphenated words in PostgreSQL full text search?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions