Postgres encoding issue - find values that have no equivalent in (client) encoding

Question 1

I have Postgresql database that internally uses UTF-8 encoding. Some clients that connect to the database use LATIN 2 (ISO 8859-2) client encoding when they connect to database. I can't change the encoding on clients. The clients are selecting values from a table that contains text. The problem is that some (not all but a small percentage) rows in this table contain characters that can not be converted to LATIN 2 so there is an error when clients try to select this rows.

I can simulate this with psql:

database=> set client_encoding = 'LATIN2';
database=> select text_field from some_table;
ERROR: character with byte sequence 0xc4 0x9f in encoding "UTF8" has no equivalent in encoding "LATIN2"

Since only a small percentage of the rows contain 'non convertible' characters I can select rows that have 'convertible' if i put the limit 10 because first 10 rows have supported characters.

database=> set client_encoding = 'LATIN2';
database=> select text_field from some_table limit 10;
..... works fine, I get 10 rows with data since all of them can be converted.

Can I somehow select the problematic rows that are causing errors with UTF-8 to LATIN2 conversion? I would like to isolate and modify (update or delete, whatever...) this rows but I can't find a way to select only this rows.

I tried the following that is not working:

With '^[[:ascii:]]*$' - not OK because this select non ascii characters (hex value > 127) that have equivalents in LATIN2 so the conversion for them works:

select * from some_table where text_field !~ '^[[:ascii:]]*$';

Comparing char_length and octet_length - not OK since it also selects rows with characters that have equivalents in LATIN2.

select * from some_table where WHERE char_length(text_field)!=octet_length(text_field);

Question 2

I'd use this SQL function:

create function is_latin2(text) returns boolean as
$$
 select 1ドル ~ '^[ !"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Ą ̆Ł¤ĽŚ§ ̈ŠŞŤŹŽŻ°ą ̨ł ́ľśˇ ̧šşťź ×ばつŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ ̇]*$'
$$ language sql strict immutable;

How it's built

According to https://en.wikipedia.org/wiki/ISO/IEC_8859-2, characters in LATIN2 have codes from 32 to 126 and 160 to 255, so a string containing all the valid characters can be generated with the following query (that should be run either in an UTF-8 or LATIN2-encoded database):

select convert_from(string_agg(set_byte('000円',0,i),'' order by i), 'LATIN2')
 from 
(select i from generate_series(32, 126) as i
 union select i from generate_series(160,255) as i) s;

It produces this result :

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Ą ̆Ł¤ĽŚ§ ̈ŠŞŤŹŽŻ°ą ̨ł ́ľśˇ ̧šşťź ×ばつŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ ̇

This can be injected into a regexp like your '^[[:ascii:]]*$' except that it has all the characters explicitly listed in the bracket expression. You might also compress the series of ascii letters into ranges.

4 characters must be quoted:

the single quote must be doubled
[ ] \ must be quoted with a backslash

score 2 · Accepted Answer · 2022-12-14 20:58:56Z

I'd use this SQL function:

create function is_latin2(text) returns boolean as
$$
 select 1ドル ~ '^[ !"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Ą ̆Ł¤ĽŚ§ ̈ŠŞŤŹŽŻ°ą ̨ł ́ľśˇ ̧šşťź ×ばつŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ ̇]*$'
$$ language sql strict immutable;

How it's built

According to https://en.wikipedia.org/wiki/ISO/IEC_8859-2, characters in LATIN2 have codes from 32 to 126 and 160 to 255, so a string containing all the valid characters can be generated with the following query (that should be run either in an UTF-8 or LATIN2-encoded database):

select convert_from(string_agg(set_byte('000円',0,i),'' order by i), 'LATIN2')
 from 
(select i from generate_series(32, 126) as i
 union select i from generate_series(160,255) as i) s;

It produces this result :

 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Ą ̆Ł¤ĽŚ§ ̈ŠŞŤŹŽŻ°ą ̨ł ́ľśˇ ̧šşťź ×ばつŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ ̇

This can be injected into a regexp like your '^[[:ascii:]]*$' except that it has all the characters explicitly listed in the bracket expression. You might also compress the series of ascii letters into ranges.

4 characters must be quoted:

the single quote must be doubled
[ ] \ must be quoted with a backslash

Stack Exchange Network

Postgres encoding issue - find values that have no equivalent in (client) encoding

1 Answer 1

How it's built

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Postgres encoding issue - find values that have no equivalent in (client) encoding

1 Answer 1

How it's built

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions