5

Given this query:

declare
 @aboutPredicateOpt nvarchar(4000) = 
 N'IsAbout( 
 PICCO weight(0.1),
 IC228 weight(0.1) 
 )'; 
 
 select RowId, BrandId, ObjectId, IsActive, OrderNumber, SearchableDescription, Rank
 from QuickSearchProducts_Temp qsp
 join Containstable(QuickSearchProducts_Temp, SearchableDescription, @aboutPredicateOpt) as ct
 on ct.[key] = qsp.RowId
 where qsp.IsActive = 1 
 order by iif(OrderNumber in ( '6403072'), 0, 1), ct.rank desc

A number of rows are returned that contain both terms. Note the row of interest has artificially been moved to top (using "order by" even though its rank is same).

Two non-rare terms

Now add a very rare term to the aboutPredicateOpt:

declare
 @aboutPredicateOpt nvarchar(4000) = 
 N'IsAbout( 
 15301530 weight(0.1),
 PICCO weight(0.1),
 IC228 weight(0.1) 
 )'; 

And the rank for the row of interest drops dramatically even though it is the only row (of 50k) that contains that (rare) string.

Add Rare Term

And for just the rare word:

declare
 @aboutPredicateOpt nvarchar(4000) = 
 N'IsAbout( 
 15301530 weight(0.1) 
 )'; 

Only that row is returned which is great. Only the rare

Chats with AI said the FTS formula likely involves overall rarity across all rows and concluded that high rarity should increase the rank when a match is found. Seems to be the opposite though.

Note that each row's SearchableDescription contains a distinct set of terms so that the frequency per row of any given term matches should always be one (ignoring wildcards searches for now).

Now a few ways to compensate for this are being considered:

  • Increase weights for "longer" or more complex terms or rarity (from histogram).
  • Calculate a ranking that counts matches using like and combine it into the main ranking.

But mainly I want to understand why the Rank drops by 50% from the top tier it was formally in. (Sql Server 2022)

Table Definition with FTS index:

CREATE TABLE [QuickSearchProducts_Temp] (
 RowId int IDENTITY(1, 1) NOT NULL,
 ObjectId bigint NOT NULL,
 BrandId smallint NOT NULL DEFAULT ((0)),
 OrderNumber nvarchar(200) NOT NULL,
 IsActive bit NOT NULL,
 SearchableDescription nvarchar(1000) NULL
);
ALTER TABLE QuickSearchProducts_Temp add constraint PK_QSP primary key (RowId);
CREATE FULLTEXT CATALOG QuickSearchFullTextCatalog AS DEFAULT;
CREATE FULLTEXT INDEX ON QuickSearchProducts_Temp ( SearchableDescription )
KEY INDEX PK_QSP ON QuickSearchFullTextCatalog WITH CHANGE_TRACKING AUTO;
set identity_insert QuickSearchProducts_Temp ON;
-- The following inserts only 10k rows of the original 50k 
-- but still illustrates the point with similar results.
--
insert into QuickSearchProducts_Temp
 (RowId, ObjectId, BrandId, OrderNumber, IsActive, SearchableDescription)
select Number as RowId, 
 Number+10000 as ObjectId, 
 Number % 5 as BrandId, 
 iif(Number = 0, '6403072', convert(varchar(30), 10000-Number)) as OrderNumber, 
 1 as IsActive, 
 SearchableDescription = 
 case
 when Number = 0
 then '15301530 Picco Ic228 bing bang boom'
 when Number <= 410
 then 'Picco Ic228 bing bang boom'
 when Number <= 959
 then 'Picco bing bang boom'
 else 'lotta other rows ' + convert(varchar(10), Number)
 end 
 -- for this use numbers table or adjust https://www.sqlservercentral.com/steps/bones-of-sql-the-calendar-table
 from Common.NumberSequence_0_10000_View
 where Number < 10000
asked Aug 20 at 6:45
4
  • 1
    Please add table definitions to complete an minimal reproducible example Commented Aug 20 at 16:45
  • @PeterVandivier, added table and index definitions. Commented Aug 20 at 17:06
  • I cannot repro this locally using only the sample data you provide (that is, my repro attempt using only the data from your screencaps ranks the 3rd example as you expect). I suspect if you are able to find a subset of rows in an mcve that tips the balance in ranking, it will help clarify your question. Commented Aug 20 at 17:35
  • 1
    Tallied the counts for the 3 terms and added a population script. Commented Aug 20 at 23:46

1 Answer 1

6

A match to a statistically rare words deflates the score of the row versus a competing row which has equal match value absent the rare word.


To plagiarize from this StackOverflow answer:

The algorithm for the full-text index being used by SQL Server is very heavily weighted to give more occurrences of a text match a better rank (bumping them up higher in your results).

From MSDN:

CONTAINSTABLE ranking uses the following algorithm:

StatisticalWeight = Log2( ( 2 + IndexedRowCount ) / KeyRowCount )
Rank = min( MaxQueryRank, HitCount * 16 * StatisticalWeight / MaxOccurrence )

In your data seeding repro, the SearchableDescription expression is:

case
 when Number = 0
 then '15301530 Picco Ic228 bing bang boom'
 when Number <= 410
 then 'Picco Ic228 bing bang boom'
 when Number <= 959
 then 'Picco bing bang boom'
 else 'lotta other rows ' + convert(varchar(10), Number)
end 

The effect of this expression is that the word 15301530 becomes more rare relative to Picco & Ic228 as you increase the rowcount up until 960 rows. After 960, the lotta other rows rows start to flatten the relative rarity these words.

Running your same @aboutPredicateOpt tests while incrementing the rowcount, we observe a tipping point at about 3,500 rows. Here the word 15301530 becomes so rare that it has net negative ranking effect in the presence of popular sibling words Picco & Ic228 when given equal weight in search.

My own local testing reveals the following trends for your 3rd test:

IsAbout(
 15301530 weight(0.1),
 PICCO weight(0.1),
 IC228 weight(0.1) 
)

scatter plot with logarithmic x axis showing weak negative trend from 500% in the top left to 0% in the bottom right

Source data:
  • rare_word_rank is the rank for OrderNumber 6403072
  • base_rank is the next highest rank as seen in OP screencaps
rowcount rare_word_rank base_rank relative_rank
7 517 117 442%
10 517 117 442%
100 464 117 397%
1,000 379 181 209%
2,000 378 315 120%
3,000 345 315 110%
3,500 356 376 95%
4,000 369 444 83%
5,000 337 444 76%
10,000 371 523 71%
50,000 297 671 44%
100,000 283 687 41%
150,000 269 681 40%
200,000 269 681 40%

If you want to dynamically re-weight your search terms, you can inspect the sys.dm_fts_index_keywords DMV; presumably interpolating the relative values through some expression into your desired @aboutPredicateOpt.

select 
 ik.keyword,
 ik.display_term,
 ik.column_id,
 ik.document_count 
from sys.dm_fts_index_keywords(
 db_id(),
 object_id(N'QuickSearchProducts_Temp')
) as ik
where -- ik.document_count > 1 or 
 ik.display_term in (
 N'15301530',
 N'picco',
 N'ic228',
 N'END OF FILE' -- fts index rowcount
 )
;

For our 3,500 rowcount example this would show:

display_term document_count
15301530 1
ic228 411
picco 960
END OF FILE 3,500

Presumably there exists a transformation on these relative values to deterministically flatten the impact of relative scarcity, but if so it is beyond my ability to parse it out of the ranking algorithm given above.

answered Aug 21 at 15:54
4
  • 1
    Great info! So rarity can actually reduce the rank, that is counterintuitive and would like someone to chime in on why this is beneficial for more typical FTS cases in general. Then in the formula is it possible describe what the values of each term would be for this example data? Then when specifying the weights, say I knew in advance the row counts for each individual term, is there a formula to specify weights that can "back out" the effect so that the row(s) with the most matching terms (count) is much more likely to be to top tier? Commented Aug 21 at 16:16
  • "...say I knew in advance the row counts for each individual term..." - you're talking about materializing the results of your ContainsTable() expression & running follow-on evaluations. You can of course do this, but not within the scope of the ContainsTable() function nor its ranking algorithm. Dynamic re-calculation of the weights after having already scanned the complete resultset seems possibly redundant to me, but perhaps you have a use case for it. Commented Aug 21 at 16:27
  • I'm saying that 1 do a one time generation of a histogram for every distinct word in every row (to be shared across all future queries). So before assigning the weights in the real query the histogram is providing this info: '15301530' = 1, IC228 = 411, picco = 960, bing = xxx, bang, yyy, etc. Commented Aug 21 at 16:44
  • See updated answer. That data can be found in sys.dm_fts_index_keywords() but I do not presume to speculate on what the optimal rebalancing expressions would be for you in the context of this comment. Commented Aug 21 at 17:27

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.