How would I make this test pass?
names = [
"cote",
"coté",
"côte",
"côté",
"ReasonE",
"Reason1",
"ReasonĔ",
"Reason Super",
"ReasonÅ",
"ReasonA",
"Reasona",
"Reasone",
"death",
"deluge",
"de luge",
"disílva John",
"diSilva John",
"di Silva Fred",
"diSilva Fred",
"disílva Fred",
"di Silva John",
]
loc = icu.Locale("und-u-ka-shifted-kb-true")
c = icu.Collator.createInstance(loc)
assert sorted(names, key=c.getSortKey) == [
"cote",
"côte",
"coté",
"côté",
"death",
"deluge",
"de luge",
"di Silva Fred",
"diSilva Fred",
"disílva Fred",
"di Silva John",
"diSilva John",
"disílva John",
"Reason1",
"Reasona",
"ReasonA",
"ReasonÅ",
"Reasone",
"ReasonE",
"ReasonĔ",
"Reason Super",
]
Background:
I'm trying to replicate the sorting behavior from a postgres database. Best I can figure is it's got custom rules for space/punctuation based on the 'shift-trimmed' option, along with 'backwards accent' (kb-true or [backwards 2])
ICU doesn't appear to support shift-trimmed and I'm not sure how else I could get these to sort "properly".
See collation settings and contextual sensitivity for more explanation.
asked Oct 14, 2024 at 15:50
Marcel Wilson
4,6284 gold badges35 silver badges68 bronze badges
-
What operating system are you using? The collation you are describing seems to be a French shift-trimmed like sort like the sort used on glibc based systems before glibc change its collation towards shifted style collation.Andj– Andj2024年10月23日 20:06:49 +00:00Commented Oct 23, 2024 at 20:06
-
1In theory the easiest cross platform approach would be to use shifted, rather than shift-trimmed. It sounds like you need compatibility with AWS postgres. The big change cam with 2.28, if memory serves. I suspect there are two approaches 1) modify the collation elements in allkeys_CLDR.txt (not the DUCET version) and use it with pyuca, or try to parse and modify the ICU sort keys. I'll have a look. the sort keys are byte arrays rather than collation elements, so I will see if I can backtrack.Andj– Andj2024年10月25日 02:40:18 +00:00Commented Oct 25, 2024 at 2:40
-
1May I ask why you need this in the first place?Olivier– Olivier2025年03月08日 16:34:08 +00:00Commented Mar 8, 2025 at 16:34
-
1@MarcelWilson the ICU user guide has following note: "Shift-Trimmed is more complicated to implement than all of the other options: When comparing strings, a lookahead (or equivalent) is needed to determine whether a non-variable character gets a zero quaternary weight (if no variables follow) or a high quaternary weight (if at least one variable follows). When building sort keys, trailing high/common quaternary weights are trimmed (backed out) at the end of the quaternary level."Andj– Andj2025年03月12日 01:17:46 +00:00Commented Mar 12, 2025 at 1:17
-
1@MarcelWilson, the key issue seems to be the quaternary weight can change for non-variable characters depending on whether a variable character follows it in the string. Additionally, other attributes can have an effect on how variable characters are processed.Andj– Andj2025年03月12日 08:19:57 +00:00Commented Mar 12, 2025 at 8:19
lang-py