python collation sort "shift-trimmed"

Asked 1 year, 2 months ago

Viewed 233 times

How would I make this test pass?

names = [
 "cote",
 "coté",
 "côte",
 "côté",
 "ReasonE",
 "Reason1",
 "ReasonĔ",
 "Reason Super",
 "ReasonÅ",
 "ReasonA",
 "Reasona",
 "Reasone",
 "death",
 "deluge",
 "de luge",
 "disílva John",
 "diSilva John",
 "di Silva Fred",
 "diSilva Fred",
 "disílva Fred",
 "di Silva John",
]
loc = icu.Locale("und-u-ka-shifted-kb-true")
c = icu.Collator.createInstance(loc)
assert sorted(names, key=c.getSortKey) == [
 "cote",
 "côte",
 "coté",
 "côté",
 "death",
 "deluge",
 "de luge",
 "di Silva Fred",
 "diSilva Fred",
 "disílva Fred",
 "di Silva John",
 "diSilva John",
 "disílva John",
 "Reason1",
 "Reasona",
 "ReasonA",
 "ReasonÅ",
 "Reasone",
 "ReasonE",
 "ReasonĔ",
 "Reason Super",
]

Background:

I'm trying to replicate the sorting behavior from a postgres database. Best I can figure is it's got custom rules for space/punctuation based on the 'shift-trimmed' option, along with 'backwards accent' (kb-true or [backwards 2])

ICU doesn't appear to support shift-trimmed and I'm not sure how else I could get these to sort "properly".

See collation settings and contextual sensitivity for more explanation.

Improve this question

edited Oct 14, 2024 at 16:02

asked Oct 14, 2024 at 15:50

Marcel Wilson's user avatar

Marcel Wilson

4,6284 gold badges35 silver badges68 bronze badges

What operating system are you using? The collation you are describing seems to be a French shift-trimmed like sort like the sort used on glibc based systems before glibc change its collation towards shifted style collation.

Andj
– Andj

2024年10月23日 20:06:49 +00:00
Commented Oct 23, 2024 at 20:06
1

In theory the easiest cross platform approach would be to use shifted, rather than shift-trimmed. It sounds like you need compatibility with AWS postgres. The big change cam with 2.28, if memory serves. I suspect there are two approaches 1) modify the collation elements in allkeys_CLDR.txt (not the DUCET version) and use it with pyuca, or try to parse and modify the ICU sort keys. I'll have a look. the sort keys are byte arrays rather than collation elements, so I will see if I can backtrack.

Andj
– Andj

2024年10月25日 02:40:18 +00:00
Commented Oct 25, 2024 at 2:40
1

May I ask why you need this in the first place?

Olivier
– Olivier

2025年03月08日 16:34:08 +00:00
Commented Mar 8, 2025 at 16:34
1

@MarcelWilson the ICU user guide has following note: "Shift-Trimmed is more complicated to implement than all of the other options: When comparing strings, a lookahead (or equivalent) is needed to determine whether a non-variable character gets a zero quaternary weight (if no variables follow) or a high quaternary weight (if at least one variable follows). When building sort keys, trailing high/common quaternary weights are trimmed (backed out) at the end of the quaternary level."

Andj
– Andj

2025年03月12日 01:17:46 +00:00
Commented Mar 12, 2025 at 1:17
1

@MarcelWilson, the key issue seems to be the quaternary weight can change for non-variable characters depending on whether a variable character follows it in the string. Additionally, other attributes can have an effect on how variable characters are processed.

Andj
– Andj

2025年03月12日 08:19:57 +00:00
Commented Mar 12, 2025 at 8:19

| Show 11 more comments

0

Sorted by: Reset to default

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

python collation sort "shift-trimmed"

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions