16

I am trying to convert an ISO 8859-1 encoded string to UTF-8.

The following function works with my testdata which contains german umlauts, but I'm not quite sure what source encoding the rune(b) cast assumes. Is it assuming some kind of default encoding, e.g. ISO8859-1 or is there any way to tell it what encoding to use?

func toUtf8(iso8859_1_buf []byte) string {
 var buf = bytes.NewBuffer(make([]byte, len(iso8859_1_buf)*4))
 for _, b := range(iso8859_1_buf) {
 r := rune(b)
 buf.WriteRune(r)
 }
 return string(buf.Bytes())
}
asked Nov 22, 2012 at 10:18
3
  • 1
    By the way, you do mean iso8859-1, right? Commented Nov 22, 2012 at 11:23
  • yes, sorry about the confusion, I've edited it. Commented Nov 22, 2012 at 12:11
  • 1
    explicit conversion from iso-8859 encoded strings can be made by using golang.org/x/text/encoding/charmap play.golang.org/p/D_lccJAqWtf Commented May 15, 2020 at 9:23

3 Answers 3

21

rune is an alias for int32, and when it comes to encoding, a rune is assumed to have a Unicode character value (code point). So the value b in rune(b) should be a unicode value. For 0x00 - 0xFF this value is identical to Latin-1, so you don't have to worry about it.

Then you need to encode the runes into UTF8. But this encoding is simply done by converting a []rune to string.

This is an example of your function without using the bytes package:

func toUtf8(iso8859_1_buf []byte) string {
 buf := make([]rune, len(iso8859_1_buf))
 for i, b := range iso8859_1_buf {
 buf[i] = rune(b)
 }
 return string(buf)
}
answered Nov 22, 2012 at 11:11
Sign up to request clarification or add additional context in comments.

4 Comments

I thought only values up to 0x7f were identical, thanks for pointing that out.
The values in Unicode and Latin-1 are identical (Latin-1 can be considered the 0x00 - 0xFF subset of Unicode). But when you store the value, Latin-1 uses only 1 byte (eg. 0x41) while Unicode uses 4 bytes (eg. 0x00000041). What might confuse is the UTF-8 encoding where only 0x00 - 0x7F are encoded in the same way as Latin-1, using a single byte.
@AdrienParrochia You have a 1) utf8-encoded string 2) decoded as latin1 3) copied to a Go byte slice where the faulty text is encoded as utf8 4) passed into a function that tries to reencode it as utf8. No. Just no :) . And during this, you've lost some data (shown by the unicode ? placeholder character). You wanted something like this: play.golang.org/p/dBrx_ZmrsMN
2

The effect of

r := rune(expression)

is:

  • Declare variable r with type rune (alias for int32).
  • Initialize variable r with the value of expresion.

No (re)encoding is involved and saying which one should be optionally used is possible only by explicitly writing/handling some re-encoding in code. Luckily, in this case no (re)encoding is necessary, Unicode incorporated those codes of ISO 8859-1 in a comparable way as ASCII. (If I checked correctly here)

answered Nov 22, 2012 at 11:16

3 Comments

Reencoding is needed. Letters like ö is not encoded in the same way. If you have the byte string latin1 = []byte{0x52, 0xE4, 0x76}, it will not convert well to string. (It says Räv in Latin-1)
But 0xE4 really is ä, not ö in ISO 8859-1: en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout. Check it here: play.golang.org/p/s4TfzJUa7m
Ah, I think I misunderstood. True that no reencoding is needed between Latin-1 and Unicode. Yes, the byte sequence says Räv
1

To convert between any of the ISO-8859 variants (and other popular legacy code pages) and UTF-8 use golang.org/x/text/encoding/charmap.

To decode this latin1 encoding:

// rivière, è latin1-encoded as 232 (0xe8)
bLatin1 := []byte{114, 105, 118, 105, 232, 114, 101}

the Charmap type has a NewDecoder method that returns a *encoding.Decoder:

dec8859_1 := charmap.ISO8859_1.NewDecoder()

This decoder can decode bytes directly:

bUTF8, _ := dec8859_1.Bytes(bLatin1)
fmt.Printf("% #x\n", bLatin1) // 0x72 0x69 0x76 0x69 0xe8 0x72 0x65
fmt.Printf("% #x\n", bUTF8) // 0x72 0x69 0x76 0x69 0xc3 0xa8 0x72 0x65

If you have file with a legacy encoding:

f, _ := os.Create("foo.txt")
f.Write(bLatin1)
f.Write([]byte("\n"))
f.Write([]byte("Seine"))

use the decoder to wrap your file's Reader:

f, _ = os.Open("foo.txt")
rLatin1 := dec8859_1.Reader(f)

and pass the new decoder-Reader:

scanner := bufio.NewScanner(rLatin1)
for i := 1; scanner.Scan(); i++ {
 fmt.Printf("line %d: %s\n", i, scanner.Text())
}
// line 1: rivière
// line 2: Seine
answered Jul 27, 2023 at 23:31

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.