1

As I'm working on a graduate project (Textmining with SQL Server 2012 Semantic Search) I run into a situation where I need to post a question on this website, hoping someone can help me.

This question is about stoplists and stopwords in SQL Server 2012. I have set up a Proof of Concept where I'm trying to index documents and list statistically relevant key phrases with the new Semantic Search functionality. Because I don't want some words to be indexed, and therefore statistically relevant key phrases, I'm creating a stoplist to exclude those words.

Stoplist/stopwords for the English language (lcid 1033):

/* Create stoplist and add words */ 
CREATE FULLTEXT STOPLIST [naam van de stoplist];
 ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'beeten' LANGUAGE 'English';
 ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'centimeter' LANGUAGE 'English';
 ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'info' LANGUAGE 'English';
 ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'ruud' LANGUAGE 'English';
GO

Creating a Full-Text Catalog, Full-Text Index with the custom stoplist and semantics:

/* Full-Text catalog */
CREATE FULLTEXT CATALOG [ft] WITH ACCENT_SENSITIVITY = ON AS DEFAULT;
GO
/* Full-Text Index */
CREATE FULLTEXT INDEX ON [dbo].[Documents]
 ( file_stream Language 1033 STATISTICAL_SEMANTICS )
 KEY INDEX DocumentsFt
 WITH STOPLIST = [naam van de stoplist];
GO

I tried everything I can think of to check if i missed something:

 /*Select all words in the stoplist, with some debug information*/
 SELECT sys.fulltext_stoplists.stoplist_id AS [Stoplist id]
 , sys.fulltext_stoplists.name AS [Stoplist]
 , sys.database_principals.name AS [Owner]
 , sys.fulltext_languages.lcid AS [LCID]
 , sys.fulltext_languages.name AS [Taal]
 , sys.fulltext_stopwords.stopword AS [Stopwoord] 
 FROM sys.fulltext_languages
 INNER JOIN sys.fulltext_stopwords 
 ON sys.fulltext_stopwords.language_id = sys.fulltext_languages.lcid
 INNER JOIN sys.fulltext_stoplists 
 ON sys.fulltext_stopwords.stoplist_id = sys.fulltext_stoplists.stoplist_id
 INNER JOIN sys.database_principals ON sys.database_principals.principal_id = sys.fulltext_stoplists.principal_id
 WHERE sys.fulltext_stoplists.name = 'naam van de stoplist';
/* List with all Full-Text Indexes (with statistical_semantics) */
SELECT sys.fulltext_catalogs.name [Full-Text catalog] 
 , sys.indexes.name AS [Index] 
 , sys.indexes.type_desc AS [Index type]
 , sys.fulltext_indexes.is_enabled AS [Index in use]
 , sys.fulltext_stoplists.name AS [Stoplist]
 , sys.tables.name AS [Table]
 , sys.columns.name AS [Column]
 , sys.fulltext_index_columns.language_id AS [LCID]
 , sys.fulltext_languages.name AS [Language]
 , sys.fulltext_index_columns.statistical_semantics [Semantic]
FROM sys.fulltext_catalogs
INNER JOIN sys.fulltext_indexes 
 ON sys.fulltext_catalogs.fulltext_catalog_id = sys.fulltext_indexes.fulltext_catalog_id
INNER JOIN sys.fulltext_index_columns 
 ON sys.fulltext_indexes.object_id = sys.fulltext_index_columns.object_id
INNER JOIN sys.indexes 
 ON sys.fulltext_indexes.object_id = sys.indexes.object_id 
 AND sys.fulltext_indexes.unique_index_id = sys.indexes.index_id
INNER JOIN sys.index_columns 
 ON sys.indexes.object_id = sys.index_columns.object_id 
 AND sys.indexes.index_id = sys.index_columns.index_id
INNER JOIN sys.columns 
 ON sys.index_columns.object_id = sys.columns.object_id 
 AND sys.index_columns.column_id = sys.columns.column_id
INNER JOIN sys.tables 
 ON sys.fulltext_indexes.object_id = sys.tables.object_id
INNER JOIN sys.fulltext_languages 
 ON sys.fulltext_index_columns.language_id = sys.fulltext_languages.lcid
LEFT JOIN sys.fulltext_stoplists 
 ON sys.fulltext_indexes.stoplist_id = sys.fulltext_stoplists.stoplist_id 
WHERE sys.fulltext_index_columns.statistical_semantics = 1
ORDER BY sys.fulltext_catalogs.name
 ,sys.indexes.name
 ,sys.index_columns.key_ordinal;
/* Rebuild catalog */
ALTER FULLTEXT CATALOG [ft] REBUILD;
GO
/* Check status of the catalog rebuild */
/* 0 = Idle.
1 = Full population is in progress.
2 = Incremental population is in progress.
3 = Propagation of tracked changes is in progress.
4 = Background update index is in progress, such as automatic change tracking.
5 = Full-text indexing is throttled or pause 
*/
SELECT FULLTEXTCATALOGPROPERTY('ft', 'PopulateStatus') AS Status;
GO
/* Repopulate Full-Text Index */
ALTER FULLTEXT INDEX ON dbo.Documents START UPDATE POPULATION;
GO

All the commands above indicate things are set up correct.

When I look in the indexed words I still see the words in the stoplist, for example 'beeten'.

SELECT * 
FROM sys.dm_fts_index_keywords(DB_ID('SQLServerArticles'), OBJECT_ID('Documents'))
WHERE display_term = 'beeten';

I've even tried if the Full-Text parser is not working properly with the following statement.

SELECT special_term, display_term
FROM sys.dm_fts_parser
(' "testing for fruit and nuts centimeter, any type of Beeten" ', 1033, 8, 0)

This statement returns the following result:

Exact Match testing
Exact Match for
Exact Match fruit
Exact Match and
Exact Match nuts
Noise Word centimeter
Exact Match any
Exact Match type
Exact Match of
Noise Word beeten

This result indicates that the word 'beeten' is an Noise Word. Should this word be skipped while indexing? What did I miss?

Again: Because I don't want some words to be indexed, and therefore statistically relevant key phrases, I'm creating a stoplist to exclude those words.

asked Dec 6, 2012 at 13:52
2
  • Isn't it just acknowledging that there are stopwords/"noise words" in the string? The documentation notes that the FT engine still needs to note its position. My understanding of noise words is that they add no significance to the weight of the result. Searching for "ruud loves a centimeter" would return the same level of significance as "he loves it." I could be wrong, though. Commented Dec 7, 2012 at 15:51
  • @swasheck my problem is that the words in the stoplist are still used/ranked in the Semantic Search functions like 'SEMANTICKEYPHRASETABLE'. Commented Dec 7, 2012 at 16:59

1 Answer 1

2

If your System Locale setting is different from English, then there is a known bug ( Microsoft Connect Item 753596 ) where the system locale stopwords are used instead of the fulltext index stopwords for documents stored in filetables.

answered Dec 7, 2012 at 17:54
2
  • 1
    The connect item don't mention whether this would affect filestreams as well as filetables. I've got a hunch that yes, it would. If you can confirm that this does affect you, then I'd recommend voting for the connect item, and attempting the work around. Commented Dec 7, 2012 at 17:56
  • Thank you for pointing out the bug report. I've tested this and I ran in the same problems. After changing my system locale to English (US) the stopwords for the English languages still didn't work. Commented Dec 10, 2012 at 10:40

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.