I have Postgresql database that internally uses UTF-8 encoding. Some clients that connect to the database use LATIN 2 (ISO 8859-2) client encoding when they connect to database. I can't change the encoding on clients. The clients are selecting values from a table that contains text. The problem is that some (not all but a small percentage) rows in this table contain characters that can not be converted to LATIN 2 so there is an error when clients try to select this rows.
I can simulate this with psql:
database=> set client_encoding = 'LATIN2';
database=> select text_field from some_table;
ERROR: character with byte sequence 0xc4 0x9f in encoding "UTF8" has no equivalent in encoding "LATIN2"
Since only a small percentage of the rows contain 'non convertible' characters I can select rows that have 'convertible' if i put the limit 10 because first 10 rows have supported characters.
database=> set client_encoding = 'LATIN2';
database=> select text_field from some_table limit 10;
..... works fine, I get 10 rows with data since all of them can be converted.
Can I somehow select the problematic rows that are causing errors with UTF-8 to LATIN2 conversion? I would like to isolate and modify (update or delete, whatever...) this rows but I can't find a way to select only this rows.
I tried the following that is not working:
With '^[[:ascii:]]*$' - not OK because this select non ascii characters (hex value > 127) that have equivalents in LATIN2 so the conversion for them works:
select * from some_table where text_field !~ '^[[:ascii:]]*$';
Comparing char_length and octet_length - not OK since it also selects rows with characters that have equivalents in LATIN2.
select * from some_table where WHERE char_length(text_field)!=octet_length(text_field);
1 Answer 1
I'd use this SQL function:
create function is_latin2(text) returns boolean as
$$
select 1ドル ~ '^[ !"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Ą ̆٤ĽŚ§ ̈ŠŞŤŹŽŻ°ą ̨ł ́ľśˇ ̧šşťź ×ばつŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ ̇]*$'
$$ language sql strict immutable;
How it's built
According to https://en.wikipedia.org/wiki/ISO/IEC_8859-2, characters in LATIN2 have codes from 32 to 126 and 160 to 255, so a string containing all the valid characters can be generated with the following query (that should be run either in an UTF-8 or LATIN2-encoded database):
select convert_from(string_agg(set_byte('000円',0,i),'' order by i), 'LATIN2')
from
(select i from generate_series(32, 126) as i
union select i from generate_series(160,255) as i) s;
It produces this result :
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ Ą ̆٤ĽŚ§ ̈ŠŞŤŹŽŻ°ą ̨ł ́ľśˇ ̧šşťź ×ばつŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ ̇
This can be injected into a regexp like your '^[[:ascii:]]*$'
except that it has all the characters explicitly listed in the bracket expression. You might also compress the series of ascii letters into ranges.
4 characters must be quoted:
- the single quote must be doubled
[ ] \
must be quoted with a backslash