I am preparing for upgrading a PostgreSQL server to Debian 10 (Buster) with the warning from Debian release notes about changes to gclibc affecting PostgreSQL in mind. The PostgreSQL wiki provides two specific strings that will be sorted differently under LC_COLLATE=en_US.UTF-8
with glibc versions prior to 2.28 (i.e. up to Debian 9) or from 2.28 (i.e. from Debian 10) respectively.
( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.UTF-8 sort # 11 1-1 under Debian 9
( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.UTF-8 sort # 1-1 11 under Debian 10
From this warning I have been expecting that similar problems occur with many string patterns. In order to better understand this I have created a table test
with 1 million rows of random samples, i.e. each primary key is a string permutation of lower case letters, digits, hyphens and German umlauts. There are two additional rows that hold the two special strings 11
and 1-1
.
When I dump this table with SELECT * FROM test ORDER BY key \g test.out
from a database with encoding UTF8 and collation en_US.UTF-8 under Debian 9 and Debian 10 respectively what I observe is that the sort order only differs with respect to the two special strings. All the other (random) strings don't seem to be affected.
This makes me wonder what is the precise nature of changes to locale en_US
(en_US.UTF-8
) in glibc 2.28. Does it only concern the relative order between -
and digits (and nothing else)? Where does the package source code reveal the exact difference (URL into GitHub, or similar)?
2 Answers 2
For reference, there's a project to collect the glibc collation into a library that one can link PostgreSQL against: https://github.com/awslabs/compat-collation-for-glibc
You can read Joe Conway's slides from PGConf 2023 at https://www.joeconway.com/presentations/glibc_issues-PGCon-2023.pdf or watch the video at https://www.youtube.com/watch?v=0E6O-V8Jato
-
1While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From ReviewRohit Gupta– Rohit Gupta2024年01月24日 20:50:48 +00:00Commented Jan 24, 2024 at 20:50
I'm not sure what the minor change were but even minor changes affect how the index stores data on disk, any difference would result in a collision and/or corruption when using a index that had the old sort method. data is could be looked-up and/or stored in the wrong places..
This requires drop and rebuilding of the index The source code to glibc can be found here
https://packages.debian.org/buster/glibc-source
Postgresql did add a new feature I think in 10 ICU
which allows the creation of custom Collation tables.
https://www.2ndquadrant.com/en/blog/icu-support-postgresql-10/
Which could be used to avoid these problems in the future