4

I am trying to decode UTF8 byte by byte with charset decoder. Is this possible?

The following code

public static void main(String[] args) {
 Charset cs = Charset.forName("utf8");
 CharsetDecoder decoder = cs.newDecoder();
 CoderResult res;
 byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
 byte[] b = new byte[1];
 ByteBuffer bb = ByteBuffer.wrap(b);
 char[] c = new char[1];
 CharBuffer cb = CharBuffer.wrap(c);
 decoder.reset();
 b[0] = source[0];
 bb.rewind();
 cb.rewind();
 res = decoder.decode(bb, cb, false);
 System.out.println(res);
 System.out.println(cb.remaining());
 b[0] = source[1];
 bb.rewind();
 cb.rewind();
 res = decoder.decode(bb, cb, false);
 System.out.println(res);
 System.out.println(cb.remaining());
}

gives the following output.

UNDERFLOW
1
MALFORMED[1]
1

Why?

asked Feb 9, 2013 at 23:06
1
  • @jlordo these reasons are offtopic in this question Commented Feb 9, 2013 at 23:21

3 Answers 3

4

My theory is that the problem with the way that you are doing it is that in the "underflow" condition, the decoder leaves the unconsumed bytes in the input buffer. At least, that is my reading.

Note this sentence in the javadoc:

"In any case, if this method is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation. "

But you are clobbering the (presumably) unread byte.

You should be able to check whether my theory / interpretation is correct by looking at how many bytes remain unconsumed in bb after the first decode(...) call.


If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.

Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.

answered Feb 9, 2013 at 23:59
4
  • Yes I also noticed this now. But it is strange that this implementation relies on me to copy unconsumed bytes to the new buffer. Also this means that buffer can't be shorter than the longest character decoded. Particularly this means that it is IMPOSSIBLE to decode byte by byte. Commented Feb 10, 2013 at 0:04
  • @SuzanCioc - not impossible. You just have to do it slightly differently. Commented Feb 10, 2013 at 0:06
  • but how? Decoder won't accept one byte and won't remember it. So I am obliged to feed it with 2 bytes (in current case). So I need at least 2-byte buffer. No way to feed by byte! Commented Feb 10, 2013 at 0:12
  • @SuzanCioc - Yes you need a buffer with a capacity of up to 6 bytes. But you can still keep adding bytes one by one ... which should satisfy your higher-level requirement of byte-by-byte decoding. Think outside the box!. Commented Feb 10, 2013 at 0:16
3

As has been said, utf has 1-6 bytes per char. you need to add all bytes to the bytebuffer before you decode try this:

public static void main(String[] args) {
 Charset cs = Charset.forName("utf8");
 CharsetDecoder decoder = cs.newDecoder();
 CoderResult res;
 byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
 byte[] b = new byte[2]; //two bytes for this char
 ByteBuffer bb = ByteBuffer.wrap(b);
 char[] c = new char[1];
 CharBuffer cb = CharBuffer.wrap(c);
 decoder.reset();
 b[0] = source[0];
 b[1] = source[1];
 bb.rewind();
 cb.rewind();
 res = decoder.decode(bb, cb, false); //translates 2 bytes to 1 char
 System.out.println(cb.remaining()); //prints 0
 System.out.println(cb.get(0)); //prints latin ae
}
answered Feb 9, 2013 at 23:29
4
  • 2
    UTF-8 has anywhere from 1 to 6 bytes per character Commented Feb 9, 2013 at 23:34
  • How can I know in advance, how many bytes should I allocate? Suppose I will add one more byte, but it also can appear to be malformed. Commented Feb 9, 2013 at 23:39
  • 1
    Allocate for six bytes. As long as the CharsetDecoder can read at least one full character at a time, it'll be happy; it'll just leave the extra bytes in the ByteBuffer, where you should compact them. Commented Feb 9, 2013 at 23:42
  • 3
    LouisWasserman @SimonG. You're both wrong. UTF-8 can contain max. 4 bytes per character. See this SO question or my blog post on this topic. Commented Aug 8, 2014 at 22:52
0

Here is my solution. The following decode a utf-8 byte sequence, in a byte by byte manner.

public static void main(String[] args) {
 //The utf-8 bytes sequences that we'll decode it
 ByteBuffer byteSequence = ByteBuffer.wrap(
 "Привет Hello 你好 こんにちは 안녕하세요,😂".getBytes(StandardCharsets.UTF_8)
 );
 StringBuilder decodeResult = new StringBuilder();
 CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
 ByteBuffer decodeBufIn = ByteBuffer.allocate(4);
 CharBuffer decodeBufOut = CharBuffer.allocate(2);
 // Due to the awful design of ByteBuffer, we need to maintain write position ourself
 int writePosition = 0;
 // Decode byte by byte
 while (byteSequence.remaining() > 0) {
 decodeBufIn.put(writePosition++, byteSequence.get());
 //Switch to read mode
 decodeBufIn.limit(writePosition);
 CoderResult r = decoder.decode(decodeBufIn, decodeBufOut, false);
 //Once the decoder produce an outcome , consume it
 if (r.isUnderflow() || r.isOverflow()) {
 if (decodeBufOut.position() > 0) {
 decodeBufOut.flip();
 decodeResult.append(decodeBufOut);
 decodeBufOut.clear();
 decodeBufIn.clear();
 writePosition = 0;
 }
 }else{
 r.throwException();
 }
 //Switch to write mode
 decodeBufIn.limit(decodeBufIn.capacity());
 if (writePosition >= decodeBufIn.capacity()) {
 throw new IllegalStateException("This should never occur!");
 }
 }
 System.out.println(decodeResult);
}
answered Aug 19, 2023 at 11:59

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.