8

I am in the process of moving a pet project from PostgreSQL (9.2.2) to SQL Server (2012 Standard).

I've noticed an interesting phenomenon when querying unicode words. Given the definition:

CREATE TABLE [word](
 [id] [int] IDENTITY(0,1) NOT NULL,
 [value] [nvarchar](255) NULL 
 );

and the data:

insert into word (value) values (N'ῥύπῳ');
insert into word (value) values (N'ἀπὸ');
insert into word (value) values (N'ἀπό');
insert into word (value) values (N'ἐπὶ');
insert into word (value) values (N'ἐπί');
insert into word (value) values (N'ὑπὸ');
insert into word (value) values (N'ὑπό');
insert into word (value) values (N'πίῃ');
insert into word (value) values (N'λόγους');
insert into word (value) values (N'λόγχῃ');
insert into word (value) values (N'λόγων');
insert into word (value) values (N'ἀλόης');

a query for a particular word will return near matches. For example:

select * from word where value = N'ἀπὸ'

returns:

id value
102137 ῥύπῳ
102141 ἀπὸ
102142 ἀπό
102143 ἐπὶ
102144 ἐπί
102145 ὑπὸ
102146 ὑπό
102147 πίῃ

http://sqlfiddle.com/#!6/1ab66/1

However, the same pattern in PostgreSQL only returns the exact match. How can I get SQL Server to do the same?

(PostgreSQL fiddle link): http://sqlfiddle.com/#!12/c57a6/1

I have the distinct feeling that I'm missing something, but I'm not quite able to figure out what it is.

The database collation is SQL_Latin1_General_CP1_CI_AS (which is also the server's collation) on a local installation.

asked Jan 3, 2013 at 23:07
0

1 Answer 1

8

The collation determines the comparison semantics.

If I try

CREATE TABLE [word](
 [id] [int] IDENTITY(0,1) NOT NULL,
 [value] [nvarchar](255) COLLATE Latin1_General_100_CI_AS NULL 
 );

It only returns ἀπὸ.

Changing the suffix to AI for accent insensitive returns ἀπό also.

On my install I have tried every collation and 1526 return 1 (presumably AS and BIN collations), 1264 return 2 rows (presumably AI) and 1095 return 8.

From a quick look through this last group looks to include all the SQL collations and 90 collations whereas all the 100 ones are in the first 2 groups so I presume this is some issue that has been fixed in the 2008 batch of collations. (See What's New in SQL Server 2008 Collations)

Script to try this yourself

DECLARE @Results TABLE
(
Count INT,
Collation SYSNAME
)
SET NOCOUNT ON;
DECLARE @N SYSNAME;
DECLARE @C1 AS CURSOR;
SET @C1 = CURSOR FAST_FORWARD FOR 
SELECT name
FROM sys.fn_helpcollations();
OPEN @C1;
FETCH NEXT FROM @C1 INTO @N ;
WHILE @@FETCH_STATUS = 0
BEGIN
 INSERT @Results
 EXEC('SELECT COUNT(*), ''' + @N + ''' from word where value = N''ἀπὸ'' COLLATE ' + @N)
 FETCH NEXT FROM @C1 INTO @N ;
END
SELECT *
FROM @Results
ORDER BY Count DESC
answered Jan 4, 2013 at 0:33
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.