Given this query:
declare
@aboutPredicateOpt nvarchar(4000) =
N'IsAbout(
PICCO weight(0.1),
IC228 weight(0.1)
)';
select RowId, BrandId, ObjectId, IsActive, OrderNumber, SearchableDescription, Rank
from QuickSearchProducts_Temp qsp
join Containstable(QuickSearchProducts_Temp, SearchableDescription, @aboutPredicateOpt) as ct
on ct.[key] = qsp.RowId
where qsp.IsActive = 1
order by iif(OrderNumber in ( '6403072'), 0, 1), ct.rank desc
A number of rows are returned that contain both terms. Note the row of interest has artificially been moved to top (using "order by" even though its rank is same).
Now add a very rare term to the aboutPredicateOpt:
declare
@aboutPredicateOpt nvarchar(4000) =
N'IsAbout(
15301530 weight(0.1),
PICCO weight(0.1),
IC228 weight(0.1)
)';
And the rank for the row of interest drops dramatically even though it is the only row (of 50k) that contains that (rare) string.
And for just the rare word:
declare
@aboutPredicateOpt nvarchar(4000) =
N'IsAbout(
15301530 weight(0.1)
)';
Only that row is returned which is great. Only the rare
Chats with AI said the FTS formula likely involves overall rarity across all rows and concluded that high rarity should increase the rank when a match is found. Seems to be the opposite though.
Note that each row's SearchableDescription contains a distinct set of terms so that the frequency per row of any given term matches should always be one (ignoring wildcards searches for now).
Now a few ways to compensate for this are being considered:
- Increase weights for "longer" or more complex terms or rarity (from histogram).
- Calculate a ranking that counts matches using like and combine it into the main ranking.
But mainly I want to understand why the Rank drops by 50% from the top tier it was formally in. (Sql Server 2022)
Table Definition with FTS index:
CREATE TABLE [QuickSearchProducts_Temp] (
RowId int IDENTITY(1, 1) NOT NULL,
ObjectId bigint NOT NULL,
BrandId smallint NOT NULL DEFAULT ((0)),
OrderNumber nvarchar(200) NOT NULL,
IsActive bit NOT NULL,
SearchableDescription nvarchar(1000) NULL
);
ALTER TABLE QuickSearchProducts_Temp add constraint PK_QSP primary key (RowId);
CREATE FULLTEXT CATALOG QuickSearchFullTextCatalog AS DEFAULT;
CREATE FULLTEXT INDEX ON QuickSearchProducts_Temp ( SearchableDescription )
KEY INDEX PK_QSP ON QuickSearchFullTextCatalog WITH CHANGE_TRACKING AUTO;
set identity_insert QuickSearchProducts_Temp ON;
-- The following inserts only 10k rows of the original 50k
-- but still illustrates the point with similar results.
--
insert into QuickSearchProducts_Temp
(RowId, ObjectId, BrandId, OrderNumber, IsActive, SearchableDescription)
select Number as RowId,
Number+10000 as ObjectId,
Number % 5 as BrandId,
iif(Number = 0, '6403072', convert(varchar(30), 10000-Number)) as OrderNumber,
1 as IsActive,
SearchableDescription =
case
when Number = 0
then '15301530 Picco Ic228 bing bang boom'
when Number <= 410
then 'Picco Ic228 bing bang boom'
when Number <= 959
then 'Picco bing bang boom'
else 'lotta other rows ' + convert(varchar(10), Number)
end
-- for this use numbers table or adjust https://www.sqlservercentral.com/steps/bones-of-sql-the-calendar-table
from Common.NumberSequence_0_10000_View
where Number < 10000
-
1Please add table definitions to complete an minimal reproducible examplePeter Vandivier– Peter Vandivier08/20/2025 16:45:22Commented Aug 20 at 16:45
-
@PeterVandivier, added table and index definitions.crokusek– crokusek08/20/2025 17:06:27Commented Aug 20 at 17:06
-
I cannot repro this locally using only the sample data you provide (that is, my repro attempt using only the data from your screencaps ranks the 3rd example as you expect). I suspect if you are able to find a subset of rows in an mcve that tips the balance in ranking, it will help clarify your question.Peter Vandivier– Peter Vandivier08/20/2025 17:35:25Commented Aug 20 at 17:35
-
1Tallied the counts for the 3 terms and added a population script.crokusek– crokusek08/20/2025 23:46:10Commented Aug 20 at 23:46
1 Answer 1
A match to a statistically rare words deflates the score of the row versus a competing row which has equal match value absent the rare word.
To plagiarize from this StackOverflow answer:
The algorithm for the full-text index being used by SQL Server is very heavily weighted to give more occurrences of a text match a better rank (bumping them up higher in your results).
From MSDN:
CONTAINSTABLE ranking uses the following algorithm:
StatisticalWeight = Log2( ( 2 + IndexedRowCount ) / KeyRowCount ) Rank = min( MaxQueryRank, HitCount * 16 * StatisticalWeight / MaxOccurrence )
In your data seeding repro, the SearchableDescription
expression is:
case
when Number = 0
then '15301530 Picco Ic228 bing bang boom'
when Number <= 410
then 'Picco Ic228 bing bang boom'
when Number <= 959
then 'Picco bing bang boom'
else 'lotta other rows ' + convert(varchar(10), Number)
end
The effect of this expression is that the word 15301530
becomes more rare relative to Picco
& Ic228
as you increase the rowcount up until 960 rows. After 960, the lotta other rows
rows start to flatten the relative rarity these words.
Running your same @aboutPredicateOpt
tests while incrementing the rowcount, we observe a tipping point at about 3,500 rows. Here the word 15301530
becomes so rare that it has net negative ranking effect in the presence of popular sibling words Picco
& Ic228
when given equal weight in search.
My own local testing reveals the following trends for your 3rd test:
IsAbout(
15301530 weight(0.1),
PICCO weight(0.1),
IC228 weight(0.1)
)
scatter plot with logarithmic x axis showing weak negative trend from 500% in the top left to 0% in the bottom right
Source data:rare_word_rank
is the rank for OrderNumber 6403072base_rank
is the next highest rank as seen in OP screencaps
rowcount | rare_word_rank | base_rank | relative_rank |
---|---|---|---|
7 | 517 | 117 | 442% |
10 | 517 | 117 | 442% |
100 | 464 | 117 | 397% |
1,000 | 379 | 181 | 209% |
2,000 | 378 | 315 | 120% |
3,000 | 345 | 315 | 110% |
3,500 | 356 | 376 | 95% |
4,000 | 369 | 444 | 83% |
5,000 | 337 | 444 | 76% |
10,000 | 371 | 523 | 71% |
50,000 | 297 | 671 | 44% |
100,000 | 283 | 687 | 41% |
150,000 | 269 | 681 | 40% |
200,000 | 269 | 681 | 40% |
If you want to dynamically re-weight your search terms, you can inspect the sys.dm_fts_index_keywords
DMV; presumably interpolating the relative values through some expression into your desired @aboutPredicateOpt
.
select
ik.keyword,
ik.display_term,
ik.column_id,
ik.document_count
from sys.dm_fts_index_keywords(
db_id(),
object_id(N'QuickSearchProducts_Temp')
) as ik
where -- ik.document_count > 1 or
ik.display_term in (
N'15301530',
N'picco',
N'ic228',
N'END OF FILE' -- fts index rowcount
)
;
For our 3,500 rowcount example this would show:
display_term | document_count |
---|---|
15301530 | 1 |
ic228 | 411 |
picco | 960 |
END OF FILE | 3,500 |
Presumably there exists a transformation on these relative values to deterministically flatten the impact of relative scarcity, but if so it is beyond my ability to parse it out of the ranking algorithm given above.
-
1Great info! So rarity can actually reduce the rank, that is counterintuitive and would like someone to chime in on why this is beneficial for more typical FTS cases in general. Then in the formula is it possible describe what the values of each term would be for this example data? Then when specifying the weights, say I knew in advance the row counts for each individual term, is there a formula to specify weights that can "back out" the effect so that the row(s) with the most matching terms (count) is much more likely to be to top tier?crokusek– crokusek08/21/2025 16:16:57Commented Aug 21 at 16:16
-
"...say I knew in advance the row counts for each individual term..." - you're talking about materializing the results of your
ContainsTable()
expression & running follow-on evaluations. You can of course do this, but not within the scope of theContainsTable()
function nor its ranking algorithm. Dynamic re-calculation of the weights after having already scanned the complete resultset seems possibly redundant to me, but perhaps you have a use case for it.Peter Vandivier– Peter Vandivier08/21/2025 16:27:13Commented Aug 21 at 16:27 -
I'm saying that 1 do a one time generation of a histogram for every distinct word in every row (to be shared across all future queries). So before assigning the weights in the real query the histogram is providing this info: '15301530' = 1, IC228 = 411, picco = 960, bing = xxx, bang, yyy, etc.crokusek– crokusek08/21/2025 16:44:10Commented Aug 21 at 16:44
-
See updated answer. That data can be found in
sys.dm_fts_index_keywords()
but I do not presume to speculate on what the optimal rebalancing expressions would be for you in the context of this comment.Peter Vandivier– Peter Vandivier08/21/2025 17:27:42Commented Aug 21 at 17:27
Explore related questions
See similar questions with these tags.