Skip to main content
Stack Overflow
  1. About
  2. For Teams

Return to Answer

Post Timeline

added 114 characters in body
Source Link
Remy Lebeau
  • 610.1k
  • 36
  • 516
  • 875

Java's String is UTF-16 encoded. You are using StandardCharsets.UTF_8 to convert a UTF-16 String into a UTF-8 byte[] array. That part is fine. If you print out the byte values, you will see that you end up with 41 42 43 c3 89 20 c3 94 70 71 72 as expected for ABCÉ Ôpqr (É is c3 89 and Ô is c3 94).

Then, you are constructing a new String from that byte[] array, but you are not specifying the encoding of the array. This is where you are going wrong. Java needs to convert the bytes back to UTF-16, so it will use its default charset, which is clearly not UTF-8 in your case, so the byte[] array will be misinterpreted. This is why you are getting the wrong output.

For example, in ISO-8859-1, byte c3 is character à but bytes 89 and 94 have no characters defined. Whereas in Windows-1252, byte c3 is character Ã, byte 89 is character , and byte 94 is character ".

If you want the byte[] array to be interpreted as UTF-8, then you have to specify that to the String constructor, eg:

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8; 
 String decoded = new String(utf8.encode(input).array(), utf8);
 // do something with decoded string...
}

Otherwise, you can use Charset.decode() instead, to match your Charset.encode():

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8;
 ByteBuffer buf = utf8.encode(input);
 String decoded = utf8.decode(buf).toString();
 // do something with decoded string...
}

Java's String is UTF-16 encoded. You are using StandardCharsets.UTF_8 to convert a UTF-16 String into a UTF-8 byte[] array. That part is fine. If you print out the byte values, you will see that you end up with 41 42 43 c3 89 20 c3 94 70 71 72 as expected for ABCÉ Ôpqr (É is c3 89 and Ô is c3 94).

Then, you are constructing a new String from that byte[] array, but you are not specifying the encoding of the array. This is where you are going wrong. Java needs to convert the bytes back to UTF-16, so it will use its default charset, which is clearly not UTF-8 in your case, so the byte[] array will be misinterpreted. This is why you are getting the wrong output.

For example, in ISO-8859-1, byte c3 is character à but bytes 89 and 94 have no characters defined.

If you want the byte[] array to be interpreted as UTF-8, then you have to specify that to the String constructor, eg:

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8; 
 String decoded = new String(utf8.encode(input).array(), utf8);
 // do something with decoded string...
}

Otherwise, you can use Charset.decode() instead, to match your Charset.encode():

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8;
 ByteBuffer buf = utf8.encode(input);
 String decoded = utf8.decode(buf).toString();
 // do something with decoded string...
}

Java's String is UTF-16 encoded. You are using StandardCharsets.UTF_8 to convert a UTF-16 String into a UTF-8 byte[] array. That part is fine. If you print out the byte values, you will see that you end up with 41 42 43 c3 89 20 c3 94 70 71 72 as expected for ABCÉ Ôpqr (É is c3 89 and Ô is c3 94).

Then, you are constructing a new String from that byte[] array, but you are not specifying the encoding of the array. This is where you are going wrong. Java needs to convert the bytes back to UTF-16, so it will use its default charset, which is clearly not UTF-8 in your case, so the byte[] array will be misinterpreted. This is why you are getting the wrong output.

For example, in ISO-8859-1, byte c3 is character à but bytes 89 and 94 have no characters defined. Whereas in Windows-1252, byte c3 is character Ã, byte 89 is character , and byte 94 is character ".

If you want the byte[] array to be interpreted as UTF-8, then you have to specify that to the String constructor, eg:

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8; 
 String decoded = new String(utf8.encode(input).array(), utf8);
 // do something with decoded string...
}

Otherwise, you can use Charset.decode() instead, to match your Charset.encode():

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8;
 ByteBuffer buf = utf8.encode(input);
 String decoded = utf8.decode(buf).toString();
 // do something with decoded string...
}
added 146 characters in body
Source Link
Remy Lebeau
  • 610.1k
  • 36
  • 516
  • 875

Java's String is UTF-16 encoded. You are using StandardCharsets.UTF_8 to convert a UTF-16 String into a UTF-8 byte[] array. That part is fine. If you print out the byte values, you will see that you end up with 41 42 43 c3 89 20 c3 94 70 71 72 as expected for ABCÉ Ôpqr (É is c3 89 and Ô is c3 94).

Then, you are constructing a new String from that byte[] array, but you are not specifying the encoding of the array. This is where you are going wrong. Java needs to convert the bytes back to UTF-16, so it will use its default charset, which is clearly not UTF-8 in your case, so the byte[] array will be misinterpreted. This is why you are getting the wrong output.

For example, in ISO-8859-1, byte c3 is character à but bytes 89 and 94 have no characters defined.

If you want the byte[] array to be interpreted as UTF-8, then you have to specify that to the String constructor, eg:

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8; 
 String decoded = new String(utf8.encode(input).array(), utf8);
 // do something with decoded string...
}

Otherwise, you can use Charset.decode() instead, to match your Charset.encode():

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8;
 ByteBuffer buf = utf8.encode(input);
 String decoded = utf8.decode(buf).toString();
 // do something with decoded string...
}

Java's String is UTF-16. You are using StandardCharsets.UTF_8 to convert a UTF-16 String into a UTF-8 byte[] array. That part is fine. If you print out the byte values, you will see that you end up with 41 42 43 c3 89 20 c3 94 70 71 72 as expected for ABCÉ Ôpqr.

Then, you are constructing a new String from that byte[] array, but you are not specifying the encoding of the array. This is where you are going wrong. Java needs to convert the bytes back to UTF-16, so it will use its default charset, which is clearly not UTF-8 in your case, so the byte[] array will be misinterpreted. This is why you are getting the wrong output.

If you want the byte[] array to be interpreted as UTF-8, then you have to specify that to the String constructor, eg:

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8; 
 String decoded = new String(utf8.encode(input).array(), utf8);
 // do something with decoded string...
}

Otherwise, you can use Charset.decode() instead, to match your Charset.encode():

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8;
 ByteBuffer buf = utf8.encode(input);
 String decoded = utf8.decode(buf).toString();
 // do something with decoded string...
}

Java's String is UTF-16 encoded. You are using StandardCharsets.UTF_8 to convert a UTF-16 String into a UTF-8 byte[] array. That part is fine. If you print out the byte values, you will see that you end up with 41 42 43 c3 89 20 c3 94 70 71 72 as expected for ABCÉ Ôpqr (É is c3 89 and Ô is c3 94).

Then, you are constructing a new String from that byte[] array, but you are not specifying the encoding of the array. This is where you are going wrong. Java needs to convert the bytes back to UTF-16, so it will use its default charset, which is clearly not UTF-8 in your case, so the byte[] array will be misinterpreted. This is why you are getting the wrong output.

For example, in ISO-8859-1, byte c3 is character à but bytes 89 and 94 have no characters defined.

If you want the byte[] array to be interpreted as UTF-8, then you have to specify that to the String constructor, eg:

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8; 
 String decoded = new String(utf8.encode(input).array(), utf8);
 // do something with decoded string...
}

Otherwise, you can use Charset.decode() instead, to match your Charset.encode():

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8;
 ByteBuffer buf = utf8.encode(input);
 String decoded = utf8.decode(buf).toString();
 // do something with decoded string...
}
Source Link
Remy Lebeau
  • 610.1k
  • 36
  • 516
  • 875

Java's String is UTF-16. You are using StandardCharsets.UTF_8 to convert a UTF-16 String into a UTF-8 byte[] array. That part is fine. If you print out the byte values, you will see that you end up with 41 42 43 c3 89 20 c3 94 70 71 72 as expected for ABCÉ Ôpqr.

Then, you are constructing a new String from that byte[] array, but you are not specifying the encoding of the array. This is where you are going wrong. Java needs to convert the bytes back to UTF-16, so it will use its default charset, which is clearly not UTF-8 in your case, so the byte[] array will be misinterpreted. This is why you are getting the wrong output.

If you want the byte[] array to be interpreted as UTF-8, then you have to specify that to the String constructor, eg:

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8; 
 String decoded = new String(utf8.encode(input).array(), utf8);
 // do something with decoded string...
}

Otherwise, you can use Charset.decode() instead, to match your Charset.encode():

public static void test() throws IOException {
 String input = "ABCÉ Ôpqr";
 Charset utf8 = StandardCharsets.UTF_8;
 ByteBuffer buf = utf8.encode(input);
 String decoded = utf8.decode(buf).toString();
 // do something with decoded string...
}
lang-java

AltStyle によって変換されたページ (->オリジナル) /