golang convert iso8859-1 to utf8

Question 1

I am trying to convert an ISO 8859-1 encoded string to UTF-8.

The following function works with my testdata which contains german umlauts, but I'm not quite sure what source encoding the rune(b) cast assumes. Is it assuming some kind of default encoding, e.g. ISO8859-1 or is there any way to tell it what encoding to use?

func toUtf8(iso8859_1_buf []byte) string {
 var buf = bytes.NewBuffer(make([]byte, len(iso8859_1_buf)*4))
 for _, b := range(iso8859_1_buf) {
 r := rune(b)
 buf.WriteRune(r)
 }
 return string(buf.Bytes())
}

Question 2

By the way, you do mean iso8859-1, right?

Question 3

yes, sorry about the confusion, I've edited it.

Question 4

explicit conversion from iso-8859 encoded strings can be made by using golang.org/x/text/encoding/charmap play.golang.org/p/D_lccJAqWtf

Question 5

rune is an alias for int32, and when it comes to encoding, a rune is assumed to have a Unicode character value (code point). So the value b in rune(b) should be a unicode value. For 0x00 - 0xFF this value is identical to Latin-1, so you don't have to worry about it.

Then you need to encode the runes into UTF8. But this encoding is simply done by converting a []rune to string.

This is an example of your function without using the bytes package:

func toUtf8(iso8859_1_buf []byte) string {
 buf := make([]rune, len(iso8859_1_buf))
 for i, b := range iso8859_1_buf {
 buf[i] = rune(b)
 }
 return string(buf)
}

Question 6

I thought only values up to 0x7f were identical, thanks for pointing that out.

Question 7

The values in Unicode and Latin-1 are identical (Latin-1 can be considered the 0x00 - 0xFF subset of Unicode). But when you store the value, Latin-1 uses only 1 byte (eg. 0x41) while Unicode uses 4 bytes (eg. 0x00000041). What might confuse is the UTF-8 encoding where only 0x00 - 0x7F are encoded in the same way as Latin-1, using a single byte.

Question 8

It's not working play.golang.org/p/Ju2Te0kVUhE

Question 9

@AdrienParrochia You have a 1) utf8-encoded string 2) decoded as latin1 3) copied to a Go byte slice where the faulty text is encoded as utf8 4) passed into a function that tries to reencode it as utf8. No. Just no :) . And during this, you've lost some data (shown by the unicode ? placeholder character). You wanted something like this: play.golang.org/p/dBrx_ZmrsMN

Question 10

The effect of

r := rune(expression)

is:

Declare variable r with type rune (alias for int32).
Initialize variable r with the value of expresion.

No (re)encoding is involved and saying which one should be optionally used is possible only by explicitly writing/handling some re-encoding in code. Luckily, in this case no (re)encoding is necessary, Unicode incorporated those codes of ISO 8859-1 in a comparable way as ASCII. (If I checked correctly here)

Question 11

Reencoding is needed. Letters like ö is not encoded in the same way. If you have the byte string latin1 = []byte{0x52, 0xE4, 0x76}, it will not convert well to string. (It says Räv in Latin-1)

Question 12

But 0xE4 really is ä, not ö in ISO 8859-1: en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout. Check it here: play.golang.org/p/s4TfzJUa7m

Question 13

Ah, I think I misunderstood. True that no reencoding is needed between Latin-1 and Unicode. Yes, the byte sequence says Räv

Question 14

To convert between any of the ISO-8859 variants (and other popular legacy code pages) and UTF-8 use golang.org/x/text/encoding/charmap.

To decode this latin1 encoding:

// rivière, è latin1-encoded as 232 (0xe8)
bLatin1 := []byte{114, 105, 118, 105, 232, 114, 101}

the Charmap type has a NewDecoder method that returns a *encoding.Decoder:

dec8859_1 := charmap.ISO8859_1.NewDecoder()

This decoder can decode bytes directly:

bUTF8, _ := dec8859_1.Bytes(bLatin1)
fmt.Printf("% #x\n", bLatin1) // 0x72 0x69 0x76 0x69 0xe8 0x72 0x65
fmt.Printf("% #x\n", bUTF8) // 0x72 0x69 0x76 0x69 0xc3 0xa8 0x72 0x65

If you have file with a legacy encoding:

f, _ := os.Create("foo.txt")
f.Write(bLatin1)
f.Write([]byte("\n"))
f.Write([]byte("Seine"))

use the decoder to wrap your file's Reader:

f, _ = os.Open("foo.txt")
rLatin1 := dec8859_1.Reader(f)

and pass the new decoder-Reader:

scanner := bufio.NewScanner(rLatin1)
for i := 1; scanner.Scan(); i++ {
 fmt.Printf("line %d: %s\n", i, scanner.Text())
}
// line 1: rivière
// line 2: Seine

ANisus 78.8k32 gold badges171 silver badges166 bronze badges · Accepted Answer · 2012-11-22 11:11:15Z

21

rune is an alias for int32, and when it comes to encoding, a rune is assumed to have a Unicode character value (code point). So the value b in rune(b) should be a unicode value. For 0x00 - 0xFF this value is identical to Latin-1, so you don't have to worry about it.

Then you need to encode the runes into UTF8. But this encoding is simply done by converting a []rune to string.

This is an example of your function without using the bytes package:

func toUtf8(iso8859_1_buf []byte) string {
 buf := make([]rune, len(iso8859_1_buf))
 for i, b := range iso8859_1_buf {
 buf[i] = rune(b)
 }
 return string(buf)
}

Share

Improve this answer

edited Aug 10, 2014 at 0:31

answered Nov 22, 2012 at 11:11

ANisus's user avatar

ANisus

78.8k32 gold badges171 silver badges166 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

zeroc8

zeroc8 Over a year ago

I thought only values up to 0x7f were identical, thanks for pointing that out.

2012年11月22日T12:06:17.227Z+00:00

ANisus

ANisus Over a year ago

The values in Unicode and Latin-1 are identical (Latin-1 can be considered the 0x00 - 0xFF subset of Unicode). But when you store the value, Latin-1 uses only 1 byte (eg. 0x41) while Unicode uses 4 bytes (eg. 0x00000041). What might confuse is the UTF-8 encoding where only 0x00 - 0x7F are encoded in the same way as Latin-1, using a single byte.

2012年11月22日T12:11:52.773Z+00:00

Adrien Parrochia

Adrien Parrochia Over a year ago

It's not working play.golang.org/p/Ju2Te0kVUhE

2021年03月01日T14:35:48.323Z+00:00

ANisus

ANisus Over a year ago

@AdrienParrochia You have a 1) utf8-encoded string 2) decoded as latin1 3) copied to a Go byte slice where the faulty text is encoded as utf8 4) passed into a function that tries to reencode it as utf8. No. Just no :) . And during this, you've lost some data (shown by the unicode ? placeholder character). You wanted something like this: play.golang.org/p/dBrx_ZmrsMN

2021年03月01日T16:21:18.253Z+00:00

CollectivesTM on Stack Overflow

golang convert iso8859-1 to utf8

3 Answers 3

4 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

4 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related