Fulltext-Search (FTS) contains function · hibernate/hibernate-orm · Discussion #5899

beikov
Jan 4, 2023
Maintainer

Add support for the full-text function contains which is documented nicely for SQL Server: https://docs.microsoft.com/en-us/sql/t-sql/queries/contains-transact-sql?view=sql-server-ver15

Also see https://en.wikibooks.org/wiki/Structured_Query_Language/Like_Predicate

Determine if there is overlap with Hibernate Search here and if we can translate contains() on ElasticSearch/OpenSearch and Lucene.

Some links:

Possible grammar for the query syntax based on an adapted form of https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc01268.1520/html/iquda/BABDFAJI.htm:

contains_function: CONTAINS "(" (path ",")+ contains_query_string, analyzer_config=string_literal? ")";
contains_query_string 
	: simple_expression
	| or_expression
	;
simple_expression
	: primary_expression
	| and_expression
	;
or_expression
	: simple_expression ( OR | "|" ) contains_query_string
	;
primary_expression
	: basic_expression
	| and_not_expression
	;
and_expression
	: primary_expression ( AND | "&" ) simple_expression
	;
and_not_expression
	: primary_expression ( AND | "&" ) ( NOT | "-" )? basic_expression
	;
basic_expression
	: term
	| phrase
	| "(" contains_query_string ")"
	| proximity_expression
	;
term
	: simple_term
	| prefix_term
	;
simple_term: [^\s]+;
prefix_term: simple_term "*";
phrase
	: "\"" phrase_string "\""
	;
proximity_expression
	: term ( BEFORE | NEAR) "[" maximum_distance=INTEGER "]" term
	| term ( BEFORE | NEAR | "~" ) term
	;
phrase_string
	: term
	| phrase_string term
	;

BEFORE is like NEAR, except that the order of terms matters.
~ is the operator for NEAR

As far as I understand, this syntax is very near to what the various DBs support, so translation should be mostly 1:1.

Translation for PostgreSQL should work out mostly after skimming through https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSQUERY. It seems though that PG is doing stemming for all terms and there is also no way to search for a phrase string from within the tsquery.

We will have to write some tests to understand how other databases do the matching exactly, but AFAIU phrase string matches will mostly work for words that can't be stemmed, so this should not be a big problem on PG.

Something that is still missing in this proposal but definitely important is affecting scoring/weighting/ranking, but all DBs have different ways to affect this.

SQL Server has ISABOUT with explicit WEIGHT (float)
MySQL has > and < operators to indicate relevance increase or decrease of options
PostgreSQL has categories A, B, C, D and can assign weights at rank time
DB2 has the ^ operator to specify the weight
Oracle has the * operator to affect the score

Also see HHH-11252

Replies: 4 comments 7 replies

yrodiere
Jan 5, 2023
Maintainer

Hey, thanks for bringing that up.

First, let's remember that CONTAINS is specific to SQL Server. Other RDBMS most likely expose FTS in different ways, which might not be easy to convert except for the simplest cases. That's fine though: a function that only works for SQL Server would already be useful to lots of people.

Determine if there is overlap with Hibernate Search here

To some extent, yes. Though Hibernate Search also provides a way to offload your FTS to another service, as well as various other features such as scoring or faceting. There are reasons for solutions like Elasticsearch/OpenSearch/Lucene to exist even though relational databases are better in non-FTS areas: they provide advanced features that relational databases don't.

if we can translate contains() on ElasticSearch/OpenSearch and Lucene.

I'm not a fan of the idea, mainly because CONTAINS is not standard SQL, so it feels a bit weird to tie ourselves to SQL-Server-specific syntax.

But even if we ignore that, CONTAINS is only part of the solution. Equally as important is the definition of the index: https://learn.microsoft.com/en-us/sql/t-sql/statements/create-fulltext-index-transact-sql?view=sql-server-ver15
Defining the index involves defining how text will be turned into tokens, which really is at the core of FTS, and customization here will dramatically change the result of a CONTAINS predicate.

Index definition involves things like tokenizing (wi-fi -> wi + fi or just wifi?) removing diacritics (résumé -> resume), stemming (running/runs -> run, bacteria/bacterium -> bacter), etc. Lots of which are use-case-dependent. And, I suspect, lots of which are only available on Elasticsearch/OpenSearch/Lucene, with no equivalent in SQL Server (or a fundamentally different equivalent, such as the synonym token filter which is apparently the FREETEXT predicate on SQL Server).

So, while we could probably translate CONTAINS predicates to Elasticsearch/OpenSearch/Lucene predicates to some extent, that wouldn't be so useful in itself. We would also have to, either:

translate the index definition from SQL Server to Elasticsearch/OpenSearch/Lucene. I doubt that's possible in a reliable way, and it would be limiting compared to what Elasticsearch/OpenSearch/Lucene can do.
OR let users define it directly in native Elasticsearch/OpenSearch/Lucene... leading to a bastard solution where you're using MS-SQL syntax when querying and Elasticsearch/OpenSearch/Lucene syntax when configuring, most likely leading to some inconsistencies.

Another thing to note is that Elasticsearch/OpenSearch/Lucene don't support cross-index joins. So if you were to translate SQL from to Elasticsearch/OpenSearch/Lucene, either you wouldn't support joins at all (limiting queries to a single table) or you would have to define joins at indexing time, like we do in Hibernate Search with @IndexedEmbedded.

And then you have concepts that only exist on Elasticsearch/OpenSearch/Lucene, which you would miss out on, because you're using Hibernate ORM and its SQL-specific APIs:

scoring and everything that goes with it, like the bool predicate which will allows an OR where having more than one true operands does matter, because it affects the score.
hit counts and faceting, where the same query execution will return not only tuples, but also extra data such as the total number of hits (beyond the limit you set for the number of returned tuples) and facets (a kind of aggregations, but not exactly in the same meaning as SQL aggregations).

To sum up:

Supporting CONTAINS on MS-SQL is great.
Trying to make it work on Elasticsearch/OpenSearch/Lucene would be possible, but so limited that I'm not sure that's worth the effort. Better stay on MS-SQL if you're going to use only CONTAINS.

3 replies

@beikov

beikov Jan 5, 2023
Maintainer Author

Actually, contains is not MSSQL specific, it is how the standard describes it. See the last section of https://en.wikibooks.org/wiki/Structured_Query_Language/Like_Predicate

The SQL standard is complemented by ISO/IEC 13249 SQL Multimedia and Application Packages

The idea is, that we expose the HQL and Criteria parts based on the standard and map that to the database specific variants if needed.

I'm not a fan of the idea, mainly because CONTAINS is not standard SQL, so it feels a bit weird to tie ourselves to SQL-Server-specific syntax.

Maybe you reconsider now that you know it is standard.

Equally as important is the definition of the index

That's a fair point. My initial idea was to just translate to the native DB predicates/functions and leave the index definition to the user, but maybe we can dig a bit into the specific FTS index definitions of the various DBs to see if we can find enough commonality that we can expose.

To be clear, this is not about replacing parts of Hibernate Search or moving to ORM core, but to make it easier for people to start out with predicates that run on the DB and then reuse the same predicates and queries if they decide to switch to Hibernate Search for FTS.

Scoring and faceting are in fact things that I haven't thought about yet too deeply. For scoring I can see FREETEXTTABLE on SQL Server and other databases have similar ways of doing this. Faceting will require a different kind of query though on SQL, so maybe that's a feature for which we direct people to Hibernate Search and Elasticsearch/OpenSearch/Lucene then.

@yrodiere

yrodiere Jan 5, 2023
Maintainer

Maybe you reconsider now that you know it is standard.

It's definitely better, but being SQL-Server-specific was not the only pain point, unfortunately :)

To be clear, this is not about replacing parts of Hibernate Search or moving to ORM core

Oh but that's fair; if some users only require contains, they definitely should use it. No need to complicate their infra with an Elasticsearch cluster if they don't use it. That's probably a relatively small amount of Hibernate Search users, though.

IMO where the contains predicate will really shine is for users who didn't have enough of a need for advanced FTS and thus couldn't/wouldn't use Hibernate Search; those will now have the option of using contains, and that's great.

to make it easier for people to start out with predicates that run on the DB and then reuse the same predicates and queries if they decide to switch to Hibernate Search for FTS.

So, contains would be a gateway drug, and users wanting more will head to Elasticsearch/OpenSearch/Lucene through Hibernate Search :) I see what you mean.

Not sure if it's better to expose this "SQL to Hibernate Search" feature in Hibernate ORM (session.createNativeQuery(HibernateSearch.extension(), "SELECT ...)) or in Hibernate Search (searchSession.searchFromSQL("SELECT ...)), though. Exposing it in Hibernate ORM would be a smoother experience at first, but would limit users to what contains provides (no way to get the total hit count, facets ...). Exposing it in Hibernate Search would mean a higher barrier to entry, but then as I said users would have to convert their index definition anyway, and that's arguably just an even higher barrier....

Anyway... making contains work in Hibernate ORM on SQL Server is a good idea regardless of what we do on the Hibernate Search front.

@beikov

beikov Jan 5, 2023
Maintainer Author

Thanks for the clarifications (also on different channels). Apparently I was one step ahead as I though Hibernate Search already has a HQL API, but apparently it doesn't. In case it gets one, it would be nice to have a contains function implementation that can be translated to Elasticsearch/OpenSearch/Lucene though.

My next step would be to come up with a grammar for the full text query string and discuss translations to native FTS query syntax. We can consider mappings to Hibernate Search stores in that discussion then as well.

yrodiere
Jan 6, 2023
Maintainer

contains_function: CONTAINS "(" (path ",")+ contains_query_string, analyzer_config=string_literal ")";

Do I understand correctly that you intend to accept multiple paths? Do all DB dialects support that? I'm not sure about postgres...
You probably know that already, but I'll say it just in case: CONTAINS(foo, bar, "word1 word2") might not be equivalent to CONTAINS(foo, "word1 word2") OR CONTAINS(bar, "word1 word2"), so support for targeting multiple paths at once is not exactly something we can emulate.

Also, I think ideally the analyzer config would be optional? I expect the string literal has a specific syntax (surrounded by quotes), so this might not lead to ambiguity?

simple_term: [^\s]+;

Just so you know, it's technically possible (at least with ES/OS/Lucene) to tokenize on something else than whitespace. While that's certainly an exotic use case, we can keep in mind that a way to escape spaces could be useful in a future version of the grammar. Probably not something for V1 though.

As far as I can see, the features exposed by this grammar all have a relatively direct equivalent in ES/OS/Lucene, so we should be fine if we want to implement it one day in Hibernate Search. I only have a doubt about the "NEAR" operator; ES/OS/Lucene have a phrase query with a "slop" option which seems equivalent if there are only two words in the phrase, but I'm not 100% sure. Tests would help for sure.

4 replies

@beikov

beikov Jan 6, 2023
Maintainer Author

Do I understand correctly that you intend to accept multiple paths? Do all DB dialects support that? I'm not sure about postgres...
so support for targeting multiple paths at once is not exactly something we can emulate.

How about contains(foo || ' ' || bar, "word1 word2") then on PG? Most DBs support this.

Also, I think ideally the analyzer config would be optional? I expect the string literal has a specific syntax (surrounded by quotes), so this might not lead to ambiguity?

Yes, forgot to add the ? To avoid ambiguity it seems that most DBs require some sort of wrapper parenthesis to construct a tuple. I'll think about it.

Just so you know, it's technically possible (at least with ES/OS/Lucene) to tokenize on something else than whitespace. While that's certainly an exotic use case, we can keep in mind that a way to escape spaces could be useful in a future version of the grammar. Probably not something for V1 though.

I know that, but in most FTS implementations on databases there is no way to include a whitespace into a token AFAIU.

As far as I can see, the features exposed by this grammar all have a relatively direct equivalent in ES/OS/Lucene, so we should be fine if we want to implement it one day in Hibernate Search.

That's assuring. Thanks for looking into it.

I only have a doubt about the "NEAR" operator; ES/OS/Lucene have a phrase query with a "slop" option which seems equivalent if there are only two words in the phrase, but I'm not 100% sure. Tests would help for sure.

That seems indeed to be the mapping for that :)

@yrodiere

yrodiere Jan 6, 2023
Maintainer

How about contains(foo || ' ' || bar, "word1 word2") then on PG? Most DBs support this.

I'm not sure I understand, PG doesn't support contains AFAICS? How would you translate that with PG?

Though I suppose that in the worst case, we can always reject calls to contains with multiple paths on DBs that don't support it.

@beikov

beikov Jan 6, 2023
Maintainer Author

That would translate to PG like to_tsvector(foo||' '||bar) @@ to_tsquery("word1 & word2")

@yrodiere

yrodiere Jan 6, 2023
Maintainer

Right, I didn't see that. Well then, it works on postgres, that's good to know :)

beikov
May 8, 2024
Maintainer Author

Adding some notes based on a discussions with @gavinking and @marko-bekhta about this:

We're thinking that it might be nice to have a minilanguage constructor function e.g. fts(..) to use like attr contains fts('...') or contains(attr, fts(...)), which allows using contains(attr, 'abc')/attr contains 'abc' to mean locate(attr, 'abc') > 0 and open the door for other minilanguages e.g. regex
The content inside fts could be parsed by the HQL parser already
Semantics of the NEAR function (that e.g. SQL Server defines) match what slop means in Lucene etc., so ditch the operator syntax for the function syntax
Terms and phrases are always analyzed i.e. queries for a term are not exact lookups e.g. the term Filming might match film, depending on the underlying DB analyzers.

A few open questions that came up:

Is the minilanguage really sufficiently useful? Compare with Lucene query language

0 replies

yrodiere
May 14, 2024
Maintainer

Is the minilanguage really sufficiently useful? Compare with Lucene query language

Note that even the Lucene query language exposes only a small part of what Lucene can do. Which is why Solr/Elasticsearch have their own language - using JSON for Elasticsearch, and I think XML for Solr.

It's really a whole world, so reducing it to a "standardized" predicate or query language is likely to only address the most common use cases. It may be enough, though.

0 replies

Uh oh!

Uh oh!

Fulltext-Search (FTS) contains function #5899

Uh oh!

Uh oh!

beikov Jan 4, 2023 Maintainer

Replies: 4 comments · 7 replies

Uh oh!

yrodiere Jan 5, 2023 Maintainer

Uh oh!

beikov Jan 5, 2023 Maintainer Author

Uh oh!

Uh oh!

yrodiere Jan 5, 2023 Maintainer

Uh oh!

beikov Jan 5, 2023 Maintainer Author

Uh oh!

yrodiere Jan 6, 2023 Maintainer

Uh oh!

beikov Jan 6, 2023 Maintainer Author

Uh oh!

yrodiere Jan 6, 2023 Maintainer

Uh oh!

beikov Jan 6, 2023 Maintainer Author

Uh oh!

yrodiere Jan 6, 2023 Maintainer

Uh oh!

beikov May 8, 2024 Maintainer Author

Uh oh!

yrodiere May 14, 2024 Maintainer

beikov
Jan 4, 2023
Maintainer

Replies: 4 comments 7 replies

yrodiere
Jan 5, 2023
Maintainer

beikov Jan 5, 2023
Maintainer Author

yrodiere Jan 5, 2023
Maintainer

beikov Jan 5, 2023
Maintainer Author

yrodiere
Jan 6, 2023
Maintainer

beikov Jan 6, 2023
Maintainer Author

yrodiere Jan 6, 2023
Maintainer

beikov Jan 6, 2023
Maintainer Author

yrodiere Jan 6, 2023
Maintainer

beikov
May 8, 2024
Maintainer Author

yrodiere
May 14, 2024
Maintainer