Confused over encoding/locale in postgresql

Question 1

I am a little confused over the difference between encoding and locale as it pertains to the postgresql database.

When initializing the database, you can specify both an encoding and a locale. Correct me if I am wrong but I assume the encoding defines what the database is actually stored as on the computer.

So if I specify UTF8, all the characters in the UTF8 character set would be valid characters. If I specified WIN1252, all those characters would be valid characters. What I dont get is where the locale comes into play. If I specify my encoding as UTF8 and then specify my locale as English_United States.1252, what does that exactly mean?

I think the WIN1252 character set is a subset of UTF8 so is the locale just specifying the subset of characters to be used from the UTF8 character set?

From what I have read is that UTF8 can be used with ANY locale so what is the point in specifying different encodings if UTF8 is so ubiquitous and the locale is really the one specifying the specific character set to be used?

Also, on Linux, the locale can be specified like so: en_us.utf8. So the database encoding is specified in the locale? If the encoding is specified in the locale, why even have a -encoding flag when initializing the database?

Question 2

encoding is how characters are physically stored (one byte, two bytes, three bytes, ...), the locale determines how characters are sorted and compared (e.g. is Häh sorted before or after Hah)

Question 3

The encoding defines the very basic rules how characters are represented in binary format (like @a_horse explains in his comment). It should be mentioned that the server encoding has to match the client encoding for successful communication. Postgres can translate if necessary, there is a dedicated setting client_encoding for this.

The locale is a superset of settings, which can be split up for PostgreSQL into

LC_COLLATE
LC_CTYPE
LC_MESSAGES
LC_MONETARY
LC_NUMERIC
LC_TIME

The settings of particular interest for you are LC_COLLATE (defines how strings are sorted) and LC_TYPE (defines the type of characters).
In older versions, these two settings could not be changed after a database had been initialized. Since Postgres 9.1 you can at least override the collation setting when needed.

Question 4

Thanks! that helped a lot. So just to make it clear, the LC_CTYPE has to be a subset of the characters represented in the encoding? So the LC_CTYPE tells you which characters in the encoding your allowed to use and which ones you cannot use?

score 4 · Accepted Answer · 2013-08-07 12:56:44Z

The encoding defines the very basic rules how characters are represented in binary format (like @a_horse explains in his comment). It should be mentioned that the server encoding has to match the client encoding for successful communication. Postgres can translate if necessary, there is a dedicated setting client_encoding for this.

The locale is a superset of settings, which can be split up for PostgreSQL into

LC_COLLATE
LC_CTYPE
LC_MESSAGES
LC_MONETARY
LC_NUMERIC
LC_TIME

The settings of particular interest for you are LC_COLLATE (defines how strings are sorted) and LC_TYPE (defines the type of characters).
In older versions, these two settings could not be changed after a database had been initialized. Since Postgres 9.1 you can at least override the collation setting when needed.

Thanks! that helped a lot. So just to make it clear, the LC_CTYPE has to be a subset of the characters represented in the encoding? So the LC_CTYPE tells you which characters in the encoding your allowed to use and which ones you cannot use?

Stack Exchange Network

Confused over encoding/locale in postgresql

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Confused over encoding/locale in postgresql

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions