-
-
Couldn't load subscription status.
- Fork 3.7k
Fulltext-Search (FTS) contains function #5899
-
Add support for the full-text function contains which is documented nicely for SQL Server: https://docs.microsoft.com/en-us/sql/t-sql/queries/contains-transact-sql?view=sql-server-ver15
Also see https://en.wikibooks.org/wiki/Structured_Query_Language/Like_Predicate
Determine if there is overlap with Hibernate Search here and if we can translate contains() on ElasticSearch/OpenSearch and Lucene.
Some links:
- SQL Server: https://docs.microsoft.com/en-us/sql/t-sql/queries/contains-transact-sql?view=sql-server-ver15
- Oracle: https://docs.oracle.com/database/121/CCAPP/GUID-7070C025-54E1-4FBB-8585-1FCBD525927D.htm#CCAPP9172 and https://docs.oracle.com/cd/B13789_01/text.101/b10730/cqoper.htm
- DB2: https://www.ibm.com/docs/en/db2/10.5?topic=functions-contains
- PostgreSQL: https://www.postgresql.org/docs/current/textsearch.html
- MySQL: https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html
- MariaDB: https://mariadb.com/kb/en/full-text-index-overview/
- HANA: https://help.sap.com/docs/SAP_HANA_PLATFORM/691cb949c1034198800afde3e5be6570/9a4da8f6f23b4ebf95c98f1104a56db4.html
- H2: http://www.h2database.com/html/tutorial.html#fulltext
Possible grammar for the query syntax based on an adapted form of https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc01268.1520/html/iquda/BABDFAJI.htm:
contains_function: CONTAINS "(" (path ",")+ contains_query_string, analyzer_config=string_literal? ")"; contains_query_string : simple_expression | or_expression ; simple_expression : primary_expression | and_expression ; or_expression : simple_expression ( OR | "|" ) contains_query_string ; primary_expression : basic_expression | and_not_expression ; and_expression : primary_expression ( AND | "&" ) simple_expression ; and_not_expression : primary_expression ( AND | "&" ) ( NOT | "-" )? basic_expression ; basic_expression : term | phrase | "(" contains_query_string ")" | proximity_expression ; term : simple_term | prefix_term ; simple_term: [^\s]+; prefix_term: simple_term "*"; phrase : "\"" phrase_string "\"" ; proximity_expression : term ( BEFORE | NEAR) "[" maximum_distance=INTEGER "]" term | term ( BEFORE | NEAR | "~" ) term ; phrase_string : term | phrase_string term ;
BEFORE is like NEAR, except that the order of terms matters.
~ is the operator for NEAR
As far as I understand, this syntax is very near to what the various DBs support, so translation should be mostly 1:1.
Translation for PostgreSQL should work out mostly after skimming through https://www.postgresql.org/docs/current/datatype-textsearch.html#DATATYPE-TSQUERY. It seems though that PG is doing stemming for all terms and there is also no way to search for a phrase string from within the tsquery.
We will have to write some tests to understand how other databases do the matching exactly, but AFAIU phrase string matches will mostly work for words that can't be stemmed, so this should not be a big problem on PG.
Something that is still missing in this proposal but definitely important is affecting scoring/weighting/ranking, but all DBs have different ways to affect this.
- SQL Server has
ISABOUTwith explicitWEIGHT (float) - MySQL has
>and<operators to indicate relevance increase or decrease of options - PostgreSQL has categories A, B, C, D and can assign weights at rank time
- DB2 has the
^operator to specify the weight - Oracle has the
*operator to affect the score
Also see HHH-11252
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 4 comments 7 replies
-
Hey, thanks for bringing that up.
First, let's remember that CONTAINS is specific to SQL Server. Other RDBMS most likely expose FTS in different ways, which might not be easy to convert except for the simplest cases. That's fine though: a function that only works for SQL Server would already be useful to lots of people.
Determine if there is overlap with Hibernate Search here
To some extent, yes. Though Hibernate Search also provides a way to offload your FTS to another service, as well as various other features such as scoring or faceting. There are reasons for solutions like Elasticsearch/OpenSearch/Lucene to exist even though relational databases are better in non-FTS areas: they provide advanced features that relational databases don't.
if we can translate
contains()on ElasticSearch/OpenSearch and Lucene.
I'm not a fan of the idea, mainly because CONTAINS is not standard SQL, so it feels a bit weird to tie ourselves to SQL-Server-specific syntax.
But even if we ignore that, CONTAINS is only part of the solution. Equally as important is the definition of the index: https://learn.microsoft.com/en-us/sql/t-sql/statements/create-fulltext-index-transact-sql?view=sql-server-ver15
Defining the index involves defining how text will be turned into tokens, which really is at the core of FTS, and customization here will dramatically change the result of a CONTAINS predicate.
Index definition involves things like tokenizing (wi-fi -> wi + fi or just wifi?) removing diacritics (résumé -> resume), stemming (running/runs -> run, bacteria/bacterium -> bacter), etc. Lots of which are use-case-dependent. And, I suspect, lots of which are only available on Elasticsearch/OpenSearch/Lucene, with no equivalent in SQL Server (or a fundamentally different equivalent, such as the synonym token filter which is apparently the FREETEXT predicate on SQL Server).
So, while we could probably translate CONTAINS predicates to Elasticsearch/OpenSearch/Lucene predicates to some extent, that wouldn't be so useful in itself. We would also have to, either:
- translate the index definition from SQL Server to Elasticsearch/OpenSearch/Lucene. I doubt that's possible in a reliable way, and it would be limiting compared to what Elasticsearch/OpenSearch/Lucene can do.
- OR let users define it directly in native Elasticsearch/OpenSearch/Lucene... leading to a bastard solution where you're using MS-SQL syntax when querying and Elasticsearch/OpenSearch/Lucene syntax when configuring, most likely leading to some inconsistencies.
Another thing to note is that Elasticsearch/OpenSearch/Lucene don't support cross-index joins. So if you were to translate SQL from to Elasticsearch/OpenSearch/Lucene, either you wouldn't support joins at all (limiting queries to a single table) or you would have to define joins at indexing time, like we do in Hibernate Search with @IndexedEmbedded.
And then you have concepts that only exist on Elasticsearch/OpenSearch/Lucene, which you would miss out on, because you're using Hibernate ORM and its SQL-specific APIs:
- scoring and everything that goes with it, like the
boolpredicate which will allows anORwhere having more than onetrueoperands does matter, because it affects the score. - hit counts and faceting, where the same query execution will return not only tuples, but also extra data such as the total number of hits (beyond the limit you set for the number of returned tuples) and facets (a kind of aggregations, but not exactly in the same meaning as SQL aggregations).
To sum up:
- Supporting
CONTAINSon MS-SQL is great. - Trying to make it work on Elasticsearch/OpenSearch/Lucene would be possible, but so limited that I'm not sure that's worth the effort. Better stay on MS-SQL if you're going to use only
CONTAINS.
Beta Was this translation helpful? Give feedback.
All reactions
-
Actually, contains is not MSSQL specific, it is how the standard describes it. See the last section of https://en.wikibooks.org/wiki/Structured_Query_Language/Like_Predicate
The SQL standard is complemented by ISO/IEC 13249 SQL Multimedia and Application Packages
The idea is, that we expose the HQL and Criteria parts based on the standard and map that to the database specific variants if needed.
I'm not a fan of the idea, mainly because CONTAINS is not standard SQL, so it feels a bit weird to tie ourselves to SQL-Server-specific syntax.
Maybe you reconsider now that you know it is standard.
Equally as important is the definition of the index
That's a fair point. My initial idea was to just translate to the native DB predicates/functions and leave the index definition to the user, but maybe we can dig a bit into the specific FTS index definitions of the various DBs to see if we can find enough commonality that we can expose.
To be clear, this is not about replacing parts of Hibernate Search or moving to ORM core, but to make it easier for people to start out with predicates that run on the DB and then reuse the same predicates and queries if they decide to switch to Hibernate Search for FTS.
Scoring and faceting are in fact things that I haven't thought about yet too deeply. For scoring I can see FREETEXTTABLE on SQL Server and other databases have similar ways of doing this. Faceting will require a different kind of query though on SQL, so maybe that's a feature for which we direct people to Hibernate Search and Elasticsearch/OpenSearch/Lucene then.
Beta Was this translation helpful? Give feedback.
All reactions
-
Maybe you reconsider now that you know it is standard.
It's definitely better, but being SQL-Server-specific was not the only pain point, unfortunately :)
To be clear, this is not about replacing parts of Hibernate Search or moving to ORM core
Oh but that's fair; if some users only require contains, they definitely should use it. No need to complicate their infra with an Elasticsearch cluster if they don't use it. That's probably a relatively small amount of Hibernate Search users, though.
IMO where the contains predicate will really shine is for users who didn't have enough of a need for advanced FTS and thus couldn't/wouldn't use Hibernate Search; those will now have the option of using contains, and that's great.
to make it easier for people to start out with predicates that run on the DB and then reuse the same predicates and queries if they decide to switch to Hibernate Search for FTS.
So, contains would be a gateway drug, and users wanting more will head to Elasticsearch/OpenSearch/Lucene through Hibernate Search :) I see what you mean.
Not sure if it's better to expose this "SQL to Hibernate Search" feature in Hibernate ORM (session.createNativeQuery(HibernateSearch.extension(), "SELECT ...)) or in Hibernate Search (searchSession.searchFromSQL("SELECT ...)), though. Exposing it in Hibernate ORM would be a smoother experience at first, but would limit users to what contains provides (no way to get the total hit count, facets ...). Exposing it in Hibernate Search would mean a higher barrier to entry, but then as I said users would have to convert their index definition anyway, and that's arguably just an even higher barrier....
Anyway... making contains work in Hibernate ORM on SQL Server is a good idea regardless of what we do on the Hibernate Search front.
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for the clarifications (also on different channels). Apparently I was one step ahead as I though Hibernate Search already has a HQL API, but apparently it doesn't. In case it gets one, it would be nice to have a contains function implementation that can be translated to Elasticsearch/OpenSearch/Lucene though.
My next step would be to come up with a grammar for the full text query string and discuss translations to native FTS query syntax. We can consider mappings to Hibernate Search stores in that discussion then as well.
Beta Was this translation helpful? Give feedback.
All reactions
-
contains_function: CONTAINS "(" (path ",")+ contains_query_string, analyzer_config=string_literal ")";
Do I understand correctly that you intend to accept multiple paths? Do all DB dialects support that? I'm not sure about postgres...
You probably know that already, but I'll say it just in case: CONTAINS(foo, bar, "word1 word2") might not be equivalent to CONTAINS(foo, "word1 word2") OR CONTAINS(bar, "word1 word2"), so support for targeting multiple paths at once is not exactly something we can emulate.
Also, I think ideally the analyzer config would be optional? I expect the string literal has a specific syntax (surrounded by quotes), so this might not lead to ambiguity?
simple_term: [^\s]+;
Just so you know, it's technically possible (at least with ES/OS/Lucene) to tokenize on something else than whitespace. While that's certainly an exotic use case, we can keep in mind that a way to escape spaces could be useful in a future version of the grammar. Probably not something for V1 though.
As far as I can see, the features exposed by this grammar all have a relatively direct equivalent in ES/OS/Lucene, so we should be fine if we want to implement it one day in Hibernate Search. I only have a doubt about the "NEAR" operator; ES/OS/Lucene have a phrase query with a "slop" option which seems equivalent if there are only two words in the phrase, but I'm not 100% sure. Tests would help for sure.
Beta Was this translation helpful? Give feedback.
All reactions
-
Do I understand correctly that you intend to accept multiple paths? Do all DB dialects support that? I'm not sure about postgres...
so support for targeting multiple paths at once is not exactly something we can emulate.
How about contains(foo || ' ' || bar, "word1 word2") then on PG? Most DBs support this.
Also, I think ideally the analyzer config would be optional? I expect the string literal has a specific syntax (surrounded by quotes), so this might not lead to ambiguity?
Yes, forgot to add the ? To avoid ambiguity it seems that most DBs require some sort of wrapper parenthesis to construct a tuple. I'll think about it.
Just so you know, it's technically possible (at least with ES/OS/Lucene) to tokenize on something else than whitespace. While that's certainly an exotic use case, we can keep in mind that a way to escape spaces could be useful in a future version of the grammar. Probably not something for V1 though.
I know that, but in most FTS implementations on databases there is no way to include a whitespace into a token AFAIU.
As far as I can see, the features exposed by this grammar all have a relatively direct equivalent in ES/OS/Lucene, so we should be fine if we want to implement it one day in Hibernate Search.
That's assuring. Thanks for looking into it.
I only have a doubt about the "NEAR" operator; ES/OS/Lucene have a phrase query with a "slop" option which seems equivalent if there are only two words in the phrase, but I'm not 100% sure. Tests would help for sure.
That seems indeed to be the mapping for that :)
Beta Was this translation helpful? Give feedback.
All reactions
-
How about
contains(foo || ' ' || bar, "word1 word2")then on PG? Most DBs support this.
I'm not sure I understand, PG doesn't support contains AFAICS? How would you translate that with PG?
Though I suppose that in the worst case, we can always reject calls to contains with multiple paths on DBs that don't support it.
Beta Was this translation helpful? Give feedback.
All reactions
-
That would translate to PG like to_tsvector(foo||' '||bar) @@ to_tsquery("word1 & word2")
Beta Was this translation helpful? Give feedback.
All reactions
-
Right, I didn't see that. Well then, it works on postgres, that's good to know :)
Beta Was this translation helpful? Give feedback.
All reactions
-
Adding some notes based on a discussions with @gavinking and @marko-bekhta about this:
- We're thinking that it might be nice to have a minilanguage constructor function e.g.
fts(..)to use likeattr contains fts('...')orcontains(attr, fts(...)), which allows usingcontains(attr, 'abc')/attr contains 'abc'to meanlocate(attr, 'abc') > 0and open the door for other minilanguages e.g.regex - The content inside
ftscould be parsed by the HQL parser already - Semantics of the
NEARfunction (that e.g. SQL Server defines) match what slop means in Lucene etc., so ditch the operator syntax for the function syntax - Terms and phrases are always analyzed i.e. queries for a term are not exact lookups e.g. the term
Filmingmight matchfilm, depending on the underlying DB analyzers.
A few open questions that came up:
- Is the minilanguage really sufficiently useful? Compare with Lucene query language
Beta Was this translation helpful? Give feedback.
All reactions
-
Is the minilanguage really sufficiently useful? Compare with Lucene query language
Note that even the Lucene query language exposes only a small part of what Lucene can do. Which is why Solr/Elasticsearch have their own language - using JSON for Elasticsearch, and I think XML for Solr.
It's really a whole world, so reducing it to a "standardized" predicate or query language is likely to only address the most common use cases. It may be enough, though.
Beta Was this translation helpful? Give feedback.