12

I have to search for hyphenated words like 'good-morning', 'good-evening', etc.

My query is:

select id, ts_headline(content,
 to_tsquery('english','good-morning'),
 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') 
from table 
where ts_content @@ to_tsquery('english','good-morning');

When executing this query I also get results of 'good' and 'morning' separately. But I want exactly matching words and fragments.
(For ts_content I used the same default config english to create the tsvector.)

How can I search such hyphenated words in PostgreSQL full text search?

Erwin Brandstetter
186k28 gold badges463 silver badges636 bronze badges
asked Apr 21, 2018 at 7:45
1
  • Assuming you run at least Postgres 9.6? (Please always declare your version of Postgres.) Commented Apr 21, 2018 at 13:23

1 Answer 1

10

The key word here is phrase search, introduced with Postgres 9.6.

Use the "FOLLOWED BY" operator <-> or one of the related <N> operators. Or better yet, use the function phraseto_tsquery() to generate your tsquery.
Quoting the manual, it ...

produces tsquery that searches for a phrase, ignoring punctuation

And:

phraseto_tsquery behaves much like plainto_tsquery, except that it inserts the <-> (FOLLOWED BY) operator between surviving words instead of the & (AND) operator. Also, stop words are not simply discarded, but are accounted for by inserting <N> operators rather than <-> operators. This function is useful when searching for exact lexeme sequences, since the FOLLOWED BY operators check lexeme order not just the presence of all the lexemes.

Your query would work like this:

select id
 , ts_headline(content, phraseto_tsquery('english', 'good-morning')
 , 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') 
from tbl 
where ts_content @@ phraseto_tsquery('english','good-morning');

phraseto_tsquery('english', 'good-morning') generates this tsquery:

'good-morn' <-> 'good' <-> 'morn'

Since "good-morning" is identified as asciihword (hyphenated ASCII word), the stemmed complete word is added before the components. The manual:

It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component: (followed by an example)

to_tsvector() basically does the same on the other end, so everything matches up. This allows for fine-grained options with hyphenated words. The above only finds "good-morning" with a hyphen (or variants stemming to the same). To find all strings with "good" followed by "morn" (or variants stemming to the same) use phraseto_tsquery('english','good morning') generating this tsquery: 'good' <-> 'morn'

OTOH, you can enforce exact matches by adding another filter like:

...
AND content ~* 'good-morning' -- case insensitive regexp match

Or:

...
AND content ILIKE '%good-morning%'

Seems a bit redundant to the human eye, but this way you get fast full text index support and exact matches.

The latter is mostly equivalent, but different (fewer) characters have special meaning in the LIKE pattern and might need escaping. Related:

Example to demonstrate the operator <N>:

phraseto_tsquery('english', 'Juliet and the Licks') generates this tsquery:

'juliet' <3> 'lick'

<3> meaning that lick must be the third lexeme after juliet.

answered Apr 21, 2018 at 12:54
4
  • Query: select id , ts_headline(content, phraseto_tsquery('english', 'rhus-t') , 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') from vqbooks where ts_content @@ phraseto_tsquery('english','rhus-t'); result: " Lyss..,, Puls., <b>Rhus</b>-t., Sabad., " and " 'infant may have a <b>Rhus</b> toxicodendron picture. (NB: <b>Rhus</b>-t desires milk) I don't want to highlight have a <b>Rhus</b> toxicodendron". I want only first fragment to be highlighted. Commented Apr 26, 2018 at 8:24
  • 1
    @user3098231: A small setting for the option MaxFragments might help some. But I am afraid that phrase search is not currently supported well in ts_headline(). A bug has been reported. See: dba.stackexchange.com/q/204856/3684 Commented Apr 26, 2018 at 11:22
  • 1
    phraseto_tsquery('english', 'good-morning') produces 'good-morn' <-> 'good' <-> 'morn', not 'good' <-> 'morn'. How are you getting this result? (I'm on Postgres 10, windows) Commented Feb 19, 2019 at 11:09
  • @dtheodor: Good catch. I rectified the error and added proper information. Commented Feb 20, 2019 at 2:10

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.