I am a little confused over the difference between encoding and locale as it pertains to the postgresql database.
When initializing the database, you can specify both an encoding and a locale. Correct me if I am wrong but I assume the encoding defines what the database is actually stored as on the computer.
So if I specify UTF8
, all the characters in the UTF8
character set would be valid characters. If I specified WIN1252
, all those characters would be valid characters. What I dont get is where the locale comes into play. If I specify my encoding as UTF8
and then specify my locale as English_United States.1252
, what does that exactly mean?
I think the WIN1252
character set is a subset of UTF8
so is the locale just specifying the subset of characters to be used from the UTF8
character set?
From what I have read is that UTF8 can be used with ANY locale so what is the point in specifying different encodings if UTF8 is so ubiquitous and the locale is really the one specifying the specific character set to be used?
Also, on Linux, the locale can be specified like so: en_us.utf8
. So the database encoding is specified in the locale? If the encoding is specified in the locale, why even have a -encoding
flag when initializing the database?
1 Answer 1
The encoding
defines the very basic rules how characters are represented in binary format (like @a_horse explains in his comment). It should be mentioned that the server encoding has to match the client encoding for successful communication. Postgres can translate if necessary, there is a dedicated setting client_encoding
for this.
The locale
is a superset of settings, which can be split up for PostgreSQL into
LC_COLLATE LC_CTYPE LC_MESSAGES LC_MONETARY LC_NUMERIC LC_TIME
The settings of particular interest for you are LC_COLLATE
(defines how strings are sorted) and LC_TYPE
(defines the type of characters).
In older versions, these two settings could not be changed after a database had been initialized. Since Postgres 9.1 you can at least override the collation setting when needed.
-
Thanks! that helped a lot. So just to make it clear, the LC_CTYPE has to be a subset of the characters represented in the encoding? So the LC_CTYPE tells you which characters in the encoding your allowed to use and which ones you cannot use?user972276– user9722762013年08月07日 13:48:07 +00:00Commented Aug 7, 2013 at 13:48
Häh
sorted before or afterHah
)