I am in the process of moving a pet project from PostgreSQL (9.2.2) to SQL Server (2012 Standard).
I've noticed an interesting phenomenon when querying unicode words. Given the definition:
CREATE TABLE [word](
[id] [int] IDENTITY(0,1) NOT NULL,
[value] [nvarchar](255) NULL
);
and the data:
insert into word (value) values (N'ῥύπῳ');
insert into word (value) values (N'ἀπὸ');
insert into word (value) values (N'ἀπό');
insert into word (value) values (N'ἐπὶ');
insert into word (value) values (N'ἐπί');
insert into word (value) values (N'ὑπὸ');
insert into word (value) values (N'ὑπό');
insert into word (value) values (N'πίῃ');
insert into word (value) values (N'λόγους');
insert into word (value) values (N'λόγχῃ');
insert into word (value) values (N'λόγων');
insert into word (value) values (N'ἀλόης');
a query for a particular word will return near matches. For example:
select * from word where value = N'ἀπὸ'
returns:
id value
102137 ῥύπῳ
102141 ἀπὸ
102142 ἀπό
102143 ἐπὶ
102144 ἐπί
102145 ὑπὸ
102146 ὑπό
102147 πίῃ
http://sqlfiddle.com/#!6/1ab66/1
However, the same pattern in PostgreSQL only returns the exact match. How can I get SQL Server to do the same?
(PostgreSQL fiddle link): http://sqlfiddle.com/#!12/c57a6/1
I have the distinct feeling that I'm missing something, but I'm not quite able to figure out what it is.
The database collation is SQL_Latin1_General_CP1_CI_AS
(which is also the server's collation) on a local installation.
1 Answer 1
The collation determines the comparison semantics.
If I try
CREATE TABLE [word](
[id] [int] IDENTITY(0,1) NOT NULL,
[value] [nvarchar](255) COLLATE Latin1_General_100_CI_AS NULL
);
It only returns ἀπὸ
.
Changing the suffix to AI
for accent insensitive returns ἀπό
also.
On my install I have tried every collation and 1526
return 1
(presumably AS
and BIN
collations), 1264
return 2 rows (presumably AI
) and 1095
return 8
.
From a quick look through this last group looks to include all the SQL
collations and 90
collations whereas all the 100
ones are in the first 2 groups so I presume this is some issue that has been fixed in the 2008 batch of collations. (See What's New in SQL Server 2008 Collations)
Script to try this yourself
DECLARE @Results TABLE
(
Count INT,
Collation SYSNAME
)
SET NOCOUNT ON;
DECLARE @N SYSNAME;
DECLARE @C1 AS CURSOR;
SET @C1 = CURSOR FAST_FORWARD FOR
SELECT name
FROM sys.fn_helpcollations();
OPEN @C1;
FETCH NEXT FROM @C1 INTO @N ;
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT @Results
EXEC('SELECT COUNT(*), ''' + @N + ''' from word where value = N''ἀπὸ'' COLLATE ' + @N)
FETCH NEXT FROM @C1 INTO @N ;
END
SELECT *
FROM @Results
ORDER BY Count DESC