I am trying to convert an ISO 8859-1 encoded string to UTF-8.
The following function works with my testdata which contains german umlauts, but I'm not quite sure what source encoding the rune(b) cast assumes. Is it assuming some kind of default encoding, e.g. ISO8859-1 or is there any way to tell it what encoding to use?
func toUtf8(iso8859_1_buf []byte) string {
var buf = bytes.NewBuffer(make([]byte, len(iso8859_1_buf)*4))
for _, b := range(iso8859_1_buf) {
r := rune(b)
buf.WriteRune(r)
}
return string(buf.Bytes())
}
-
1By the way, you do mean iso8859-1, right?ANisus– ANisus2012年11月22日 11:23:54 +00:00Commented Nov 22, 2012 at 11:23
-
yes, sorry about the confusion, I've edited it.zeroc8– zeroc82012年11月22日 12:11:50 +00:00Commented Nov 22, 2012 at 12:11
-
1explicit conversion from iso-8859 encoded strings can be made by using golang.org/x/text/encoding/charmap play.golang.org/p/D_lccJAqWtfMatthias Wiedemann– Matthias Wiedemann2020年05月15日 09:23:52 +00:00Commented May 15, 2020 at 9:23
3 Answers 3
rune is an alias for int32, and when it comes to encoding, a rune is assumed to have a Unicode character value (code point). So the value b in rune(b) should be a unicode value. For 0x00 - 0xFF this value is identical to Latin-1, so you don't have to worry about it.
Then you need to encode the runes into UTF8. But this encoding is simply done by converting a []rune to string.
This is an example of your function without using the bytes package:
func toUtf8(iso8859_1_buf []byte) string {
buf := make([]rune, len(iso8859_1_buf))
for i, b := range iso8859_1_buf {
buf[i] = rune(b)
}
return string(buf)
}
4 Comments
0x41) while Unicode uses 4 bytes (eg. 0x00000041). What might confuse is the UTF-8 encoding where only 0x00 - 0x7F are encoded in the same way as Latin-1, using a single byte.? placeholder character). You wanted something like this: play.golang.org/p/dBrx_ZmrsMN The effect of
r := rune(expression)
is:
- Declare variable
rwith typerune(alias for int32). - Initialize variable
rwith the value of expresion.
No (re)encoding is involved and saying which one should be optionally used is possible only by explicitly writing/handling some re-encoding in code. Luckily, in this case no (re)encoding is necessary, Unicode incorporated those codes of ISO 8859-1 in a comparable way as ASCII. (If I checked correctly here)
3 Comments
latin1 = []byte{0x52, 0xE4, 0x76}, it will not convert well to string. (It says Räv in Latin-1)ä, not ö in ISO 8859-1: en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout. Check it here: play.golang.org/p/s4TfzJUa7m To convert between any of the ISO-8859 variants (and other popular legacy code pages) and UTF-8 use golang.org/x/text/encoding/charmap.
To decode this latin1 encoding:
// rivière, è latin1-encoded as 232 (0xe8)
bLatin1 := []byte{114, 105, 118, 105, 232, 114, 101}
the Charmap type has a NewDecoder method that returns a *encoding.Decoder:
dec8859_1 := charmap.ISO8859_1.NewDecoder()
This decoder can decode bytes directly:
bUTF8, _ := dec8859_1.Bytes(bLatin1)
fmt.Printf("% #x\n", bLatin1) // 0x72 0x69 0x76 0x69 0xe8 0x72 0x65
fmt.Printf("% #x\n", bUTF8) // 0x72 0x69 0x76 0x69 0xc3 0xa8 0x72 0x65
If you have file with a legacy encoding:
f, _ := os.Create("foo.txt")
f.Write(bLatin1)
f.Write([]byte("\n"))
f.Write([]byte("Seine"))
use the decoder to wrap your file's Reader:
f, _ = os.Open("foo.txt")
rLatin1 := dec8859_1.Reader(f)
and pass the new decoder-Reader:
scanner := bufio.NewScanner(rLatin1)
for i := 1; scanner.Scan(); i++ {
fmt.Printf("line %d: %s\n", i, scanner.Text())
}
// line 1: rivière
// line 2: Seine