3

How would I make this test pass?

names = [
 "cote",
 "coté",
 "côte",
 "côté",
 "ReasonE",
 "Reason1",
 "ReasonĔ",
 "Reason Super",
 "ReasonÅ",
 "ReasonA",
 "Reasona",
 "Reasone",
 "death",
 "deluge",
 "de luge",
 "disílva John",
 "diSilva John",
 "di Silva Fred",
 "diSilva Fred",
 "disílva Fred",
 "di Silva John",
]
loc = icu.Locale("und-u-ka-shifted-kb-true")
c = icu.Collator.createInstance(loc)
assert sorted(names, key=c.getSortKey) == [
 "cote",
 "côte",
 "coté",
 "côté",
 "death",
 "deluge",
 "de luge",
 "di Silva Fred",
 "diSilva Fred",
 "disílva Fred",
 "di Silva John",
 "diSilva John",
 "disílva John",
 "Reason1",
 "Reasona",
 "ReasonA",
 "ReasonÅ",
 "Reasone",
 "ReasonE",
 "ReasonĔ",
 "Reason Super",
]

Background:

I'm trying to replicate the sorting behavior from a postgres database. Best I can figure is it's got custom rules for space/punctuation based on the 'shift-trimmed' option, along with 'backwards accent' (kb-true or [backwards 2])

ICU doesn't appear to support shift-trimmed and I'm not sure how else I could get these to sort "properly".

See collation settings and contextual sensitivity for more explanation.

asked Oct 14, 2024 at 15:50
16
  • What operating system are you using? The collation you are describing seems to be a French shift-trimmed like sort like the sort used on glibc based systems before glibc change its collation towards shifted style collation. Commented Oct 23, 2024 at 20:06
  • 1
    In theory the easiest cross platform approach would be to use shifted, rather than shift-trimmed. It sounds like you need compatibility with AWS postgres. The big change cam with 2.28, if memory serves. I suspect there are two approaches 1) modify the collation elements in allkeys_CLDR.txt (not the DUCET version) and use it with pyuca, or try to parse and modify the ICU sort keys. I'll have a look. the sort keys are byte arrays rather than collation elements, so I will see if I can backtrack. Commented Oct 25, 2024 at 2:40
  • 1
    May I ask why you need this in the first place? Commented Mar 8, 2025 at 16:34
  • 1
    @MarcelWilson the ICU user guide has following note: "Shift-Trimmed is more complicated to implement than all of the other options: When comparing strings, a lookahead (or equivalent) is needed to determine whether a non-variable character gets a zero quaternary weight (if no variables follow) or a high quaternary weight (if at least one variable follows). When building sort keys, trailing high/common quaternary weights are trimmed (backed out) at the end of the quaternary level." Commented Mar 12, 2025 at 1:17
  • 1
    @MarcelWilson, the key issue seems to be the quaternary weight can change for non-variable characters depending on whether a variable character follows it in the string. Additionally, other attributes can have an effect on how variable characters are processed. Commented Mar 12, 2025 at 8:19

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.