Why does SQL Server Full Text Search (FTS) rank go down when match count goes up due to highly rare word

Question 1

Given this query:

declare
 @aboutPredicateOpt nvarchar(4000) = 
 N'IsAbout( 
 PICCO weight(0.1),
 IC228 weight(0.1) 
 )'; 
 
 select RowId, BrandId, ObjectId, IsActive, OrderNumber, SearchableDescription, Rank
 from QuickSearchProducts_Temp qsp
 join Containstable(QuickSearchProducts_Temp, SearchableDescription, @aboutPredicateOpt) as ct
 on ct.[key] = qsp.RowId
 where qsp.IsActive = 1 
 order by iif(OrderNumber in ( '6403072'), 0, 1), ct.rank desc

A number of rows are returned that contain both terms. Note the row of interest has artificially been moved to top (using "order by" even though its rank is same).

Two non-rare terms

Now add a very rare term to the aboutPredicateOpt:

declare
 @aboutPredicateOpt nvarchar(4000) = 
 N'IsAbout( 
 15301530 weight(0.1),
 PICCO weight(0.1),
 IC228 weight(0.1) 
 )';

And the rank for the row of interest drops dramatically even though it is the only row (of 50k) that contains that (rare) string.

Add Rare Term

And for just the rare word:

declare
 @aboutPredicateOpt nvarchar(4000) = 
 N'IsAbout( 
 15301530 weight(0.1) 
 )';

Only that row is returned which is great. Only the rare

Chats with AI said the FTS formula likely involves overall rarity across all rows and concluded that high rarity should increase the rank when a match is found. Seems to be the opposite though.

Note that each row's SearchableDescription contains a distinct set of terms so that the frequency per row of any given term matches should always be one (ignoring wildcards searches for now).

Now a few ways to compensate for this are being considered:

Increase weights for "longer" or more complex terms or rarity (from histogram).
Calculate a ranking that counts matches using like and combine it into the main ranking.

But mainly I want to understand why the Rank drops by 50% from the top tier it was formally in. (Sql Server 2022)

Table Definition with FTS index:

CREATE TABLE [QuickSearchProducts_Temp] (
 RowId int IDENTITY(1, 1) NOT NULL,
 ObjectId bigint NOT NULL,
 BrandId smallint NOT NULL DEFAULT ((0)),
 OrderNumber nvarchar(200) NOT NULL,
 IsActive bit NOT NULL,
 SearchableDescription nvarchar(1000) NULL
);
ALTER TABLE QuickSearchProducts_Temp add constraint PK_QSP primary key (RowId);
CREATE FULLTEXT CATALOG QuickSearchFullTextCatalog AS DEFAULT;
CREATE FULLTEXT INDEX ON QuickSearchProducts_Temp ( SearchableDescription )
KEY INDEX PK_QSP ON QuickSearchFullTextCatalog WITH CHANGE_TRACKING AUTO;
set identity_insert QuickSearchProducts_Temp ON;
-- The following inserts only 10k rows of the original 50k 
-- but still illustrates the point with similar results.
--
insert into QuickSearchProducts_Temp
 (RowId, ObjectId, BrandId, OrderNumber, IsActive, SearchableDescription)
select Number as RowId, 
 Number+10000 as ObjectId, 
 Number % 5 as BrandId, 
 iif(Number = 0, '6403072', convert(varchar(30), 10000-Number)) as OrderNumber, 
 1 as IsActive, 
 SearchableDescription = 
 case
 when Number = 0
 then '15301530 Picco Ic228 bing bang boom'
 when Number <= 410
 then 'Picco Ic228 bing bang boom'
 when Number <= 959
 then 'Picco bing bang boom'
 else 'lotta other rows ' + convert(varchar(10), Number)
 end 
 -- for this use numbers table or adjust https://www.sqlservercentral.com/steps/bones-of-sql-the-calendar-table
 from Common.NumberSequence_0_10000_View
 where Number < 10000

Question 2

Please add table definitions to complete an minimal reproducible example

Question 3

@PeterVandivier, added table and index definitions.

Question 4

I cannot repro this locally using only the sample data you provide (that is, my repro attempt using only the data from your screencaps ranks the 3rd example as you expect). I suspect if you are able to find a subset of rows in an mcve that tips the balance in ranking, it will help clarify your question.

Question 5

Tallied the counts for the 3 terms and added a population script.

Question 6

A match to a statistically rare words deflates the score of the row versus a competing row which has equal match value absent the rare word.

To plagiarize from this StackOverflow answer:

The algorithm for the full-text index being used by SQL Server is very heavily weighted to give more occurrences of a text match a better rank (bumping them up higher in your results).

From MSDN:
CONTAINSTABLE ranking uses the following algorithm:
StatisticalWeight = Log2( ( 2 + IndexedRowCount ) / KeyRowCount )
Rank = min( MaxQueryRank, HitCount * 16 * StatisticalWeight / MaxOccurrence )

In your data seeding repro, the SearchableDescription expression is:

case
 when Number = 0
 then '15301530 Picco Ic228 bing bang boom'
 when Number <= 410
 then 'Picco Ic228 bing bang boom'
 when Number <= 959
 then 'Picco bing bang boom'
 else 'lotta other rows ' + convert(varchar(10), Number)
end

The effect of this expression is that the word 15301530 becomes more rare relative to Picco & Ic228 as you increase the rowcount up until 960 rows. After 960, the lotta other rows rows start to flatten the relative rarity these words.

Running your same @aboutPredicateOpt tests while incrementing the rowcount, we observe a tipping point at about 3,500 rows. Here the word 15301530 becomes so rare that it has net negative ranking effect in the presence of popular sibling words Picco & Ic228 when given equal weight in search.

My own local testing reveals the following trends for your 3rd test:

IsAbout(
 15301530 weight(0.1),
 PICCO weight(0.1),
 IC228 weight(0.1) 
)

scatter plot with logarithmic x axis showing weak negative trend from 500% in the top left to 0% in the bottom right

Source data:

rare_word_rank is the rank for OrderNumber 6403072
base_rank is the next highest rank as seen in OP screencaps

rowcount	rare_word_rank	base_rank	relative_rank
7	517	117	442%
10	517	117	442%
100	464	117	397%
1,000	379	181	209%
2,000	378	315	120%
3,000	345	315	110%
3,500	356	376	95%
4,000	369	444	83%
5,000	337	444	76%
10,000	371	523	71%
50,000	297	671	44%
100,000	283	687	41%
150,000	269	681	40%
200,000	269	681	40%

If you want to dynamically re-weight your search terms, you can inspect the sys.dm_fts_index_keywords DMV; presumably interpolating the relative values through some expression into your desired @aboutPredicateOpt.

select 
 ik.keyword,
 ik.display_term,
 ik.column_id,
 ik.document_count 
from sys.dm_fts_index_keywords(
 db_id(),
 object_id(N'QuickSearchProducts_Temp')
) as ik
where -- ik.document_count > 1 or 
 ik.display_term in (
 N'15301530',
 N'picco',
 N'ic228',
 N'END OF FILE' -- fts index rowcount
 )
;

For our 3,500 rowcount example this would show:

display_term	document_count
15301530	1
ic228	411
picco	960
END OF FILE	3,500

Presumably there exists a transformation on these relative values to deterministically flatten the impact of relative scarcity, but if so it is beyond my ability to parse it out of the ranking algorithm given above.

Question 7

Great info! So rarity can actually reduce the rank, that is counterintuitive and would like someone to chime in on why this is beneficial for more typical FTS cases in general. Then in the formula is it possible describe what the values of each term would be for this example data? Then when specifying the weights, say I knew in advance the row counts for each individual term, is there a formula to specify weights that can "back out" the effect so that the row(s) with the most matching terms (count) is much more likely to be to top tier?

Question 8

"...say I knew in advance the row counts for each individual term..." - you're talking about materializing the results of your ContainsTable() expression & running follow-on evaluations. You can of course do this, but not within the scope of the ContainsTable() function nor its ranking algorithm. Dynamic re-calculation of the weights after having already scanned the complete resultset seems possibly redundant to me, but perhaps you have a use case for it.

Question 9

I'm saying that 1 do a one time generation of a histogram for every distinct word in every row (to be shared across all future queries). So before assigning the weights in the real query the histogram is providing this info: '15301530' = 1, IC228 = 411, picco = 960, bing = xxx, bang, yyy, etc.

Question 10

See updated answer. That data can be found in sys.dm_fts_index_keywords() but I do not presume to speculate on what the optimal rebalancing expressions would be for you in the context of this comment.

score 6 · Accepted Answer · 2025-08-21 15:54:46Z

A match to a statistically rare words deflates the score of the row versus a competing row which has equal match value absent the rare word.

To plagiarize from this StackOverflow answer:

The algorithm for the full-text index being used by SQL Server is very heavily weighted to give more occurrences of a text match a better rank (bumping them up higher in your results).

From MSDN:
CONTAINSTABLE ranking uses the following algorithm:
StatisticalWeight = Log2( ( 2 + IndexedRowCount ) / KeyRowCount )
Rank = min( MaxQueryRank, HitCount * 16 * StatisticalWeight / MaxOccurrence )

In your data seeding repro, the SearchableDescription expression is:

case
 when Number = 0
 then '15301530 Picco Ic228 bing bang boom'
 when Number <= 410
 then 'Picco Ic228 bing bang boom'
 when Number <= 959
 then 'Picco bing bang boom'
 else 'lotta other rows ' + convert(varchar(10), Number)
end

The effect of this expression is that the word 15301530 becomes more rare relative to Picco & Ic228 as you increase the rowcount up until 960 rows. After 960, the lotta other rows rows start to flatten the relative rarity these words.

Running your same @aboutPredicateOpt tests while incrementing the rowcount, we observe a tipping point at about 3,500 rows. Here the word 15301530 becomes so rare that it has net negative ranking effect in the presence of popular sibling words Picco & Ic228 when given equal weight in search.

My own local testing reveals the following trends for your 3rd test:

IsAbout(
 15301530 weight(0.1),
 PICCO weight(0.1),
 IC228 weight(0.1) 
)

scatter plot with logarithmic x axis showing weak negative trend from 500% in the top left to 0% in the bottom right

Source data:

rare_word_rank is the rank for OrderNumber 6403072
base_rank is the next highest rank as seen in OP screencaps

rowcount	rare_word_rank	base_rank	relative_rank
7	517	117	442%
10	517	117	442%
100	464	117	397%
1,000	379	181	209%
2,000	378	315	120%
3,000	345	315	110%
3,500	356	376	95%
4,000	369	444	83%
5,000	337	444	76%
10,000	371	523	71%
50,000	297	671	44%
100,000	283	687	41%
150,000	269	681	40%
200,000	269	681	40%

If you want to dynamically re-weight your search terms, you can inspect the sys.dm_fts_index_keywords DMV; presumably interpolating the relative values through some expression into your desired @aboutPredicateOpt.

select 
 ik.keyword,
 ik.display_term,
 ik.column_id,
 ik.document_count 
from sys.dm_fts_index_keywords(
 db_id(),
 object_id(N'QuickSearchProducts_Temp')
) as ik
where -- ik.document_count > 1 or 
 ik.display_term in (
 N'15301530',
 N'picco',
 N'ic228',
 N'END OF FILE' -- fts index rowcount
 )
;

For our 3,500 rowcount example this would show:

display_term	document_count
15301530	1
ic228	411
picco	960
END OF FILE	3,500

Presumably there exists a transformation on these relative values to deterministically flatten the impact of relative scarcity, but if so it is beyond my ability to parse it out of the ranking algorithm given above.

Great info! So rarity can actually reduce the rank, that is counterintuitive and would like someone to chime in on why this is beneficial for more typical FTS cases in general. Then in the formula is it possible describe what the values of each term would be for this example data? Then when specifying the weights, say I knew in advance the row counts for each individual term, is there a formula to specify weights that can "back out" the effect so that the row(s) with the most matching terms (count) is much more likely to be to top tier?
"...say I knew in advance the row counts for each individual term..." - you're talking about materializing the results of your ContainsTable() expression & running follow-on evaluations. You can of course do this, but not within the scope of the ContainsTable() function nor its ranking algorithm. Dynamic re-calculation of the weights after having already scanned the complete resultset seems possibly redundant to me, but perhaps you have a use case for it.
I'm saying that 1 do a one time generation of a histogram for every distinct word in every row (to be shared across all future queries). So before assigning the weights in the real query the histogram is providing this info: '15301530' = 1, IC228 = 411, picco = 960, bing = xxx, bang, yyy, etc.
See updated answer. That data can be found in sys.dm_fts_index_keywords() but I do not presume to speculate on what the optimal rebalancing expressions would be for you in the context of this comment.

Stack Exchange Network

Why does SQL Server Full Text Search (FTS) rank go down when match count goes up due to highly rare word

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Why does SQL Server Full Text Search (FTS) rank go down when match count goes up due to highly rare word

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions