Decoding multibyte UTF8 symbols with charset decoder in byte-by-byte manner?

Question 1

I am trying to decode UTF8 byte by byte with charset decoder. Is this possible?

The following code

public static void main(String[] args) {
 Charset cs = Charset.forName("utf8");
 CharsetDecoder decoder = cs.newDecoder();
 CoderResult res;
 byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
 byte[] b = new byte[1];
 ByteBuffer bb = ByteBuffer.wrap(b);
 char[] c = new char[1];
 CharBuffer cb = CharBuffer.wrap(c);
 decoder.reset();
 b[0] = source[0];
 bb.rewind();
 cb.rewind();
 res = decoder.decode(bb, cb, false);
 System.out.println(res);
 System.out.println(cb.remaining());
 b[0] = source[1];
 bb.rewind();
 cb.rewind();
 res = decoder.decode(bb, cb, false);
 System.out.println(res);
 System.out.println(cb.remaining());
}

gives the following output.

UNDERFLOW
1
MALFORMED[1]
1

Why?

Question 2

@jlordo these reasons are offtopic in this question

Question 3

My theory is that the problem with the way that you are doing it is that in the "underflow" condition, the decoder leaves the unconsumed bytes in the input buffer. At least, that is my reading.

Note this sentence in the javadoc:

"In any case, if this method is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation. "

But you are clobbering the (presumably) unread byte.

You should be able to check whether my theory / interpretation is correct by looking at how many bytes remain unconsumed in bb after the first decode(...) call.

If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.

Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.

Question 4

Yes I also noticed this now. But it is strange that this implementation relies on me to copy unconsumed bytes to the new buffer. Also this means that buffer can't be shorter than the longest character decoded. Particularly this means that it is IMPOSSIBLE to decode byte by byte.

Question 5

@SuzanCioc - not impossible. You just have to do it slightly differently.

Question 6

but how? Decoder won't accept one byte and won't remember it. So I am obliged to feed it with 2 bytes (in current case). So I need at least 2-byte buffer. No way to feed by byte!

Question 7

@SuzanCioc - Yes you need a buffer with a capacity of up to 6 bytes. But you can still keep adding bytes one by one ... which should satisfy your higher-level requirement of byte-by-byte decoding. Think outside the box!.

Question 8

As has been said, utf has 1-6 bytes per char. you need to add all bytes to the bytebuffer before you decode try this:

public static void main(String[] args) {
 Charset cs = Charset.forName("utf8");
 CharsetDecoder decoder = cs.newDecoder();
 CoderResult res;
 byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
 byte[] b = new byte[2]; //two bytes for this char
 ByteBuffer bb = ByteBuffer.wrap(b);
 char[] c = new char[1];
 CharBuffer cb = CharBuffer.wrap(c);
 decoder.reset();
 b[0] = source[0];
 b[1] = source[1];
 bb.rewind();
 cb.rewind();
 res = decoder.decode(bb, cb, false); //translates 2 bytes to 1 char
 System.out.println(cb.remaining()); //prints 0
 System.out.println(cb.get(0)); //prints latin ae
}

Question 9

UTF-8 has anywhere from 1 to 6 bytes per character

Question 10

How can I know in advance, how many bytes should I allocate? Suppose I will add one more byte, but it also can appear to be malformed.

Question 11

Allocate for six bytes. As long as the CharsetDecoder can read at least one full character at a time, it'll be happy; it'll just leave the extra bytes in the ByteBuffer, where you should compact them.

Question 12

LouisWasserman @SimonG. You're both wrong. UTF-8 can contain max. 4 bytes per character. See this SO question or my blog post on this topic.

Question 13

Here is my solution. The following decode a utf-8 byte sequence, in a byte by byte manner.

public static void main(String[] args) {
 //The utf-8 bytes sequences that we'll decode it
 ByteBuffer byteSequence = ByteBuffer.wrap(
 "Привет Hello 你好 こんにちは 안녕하세요,😂".getBytes(StandardCharsets.UTF_8)
 );
 StringBuilder decodeResult = new StringBuilder();
 CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
 ByteBuffer decodeBufIn = ByteBuffer.allocate(4);
 CharBuffer decodeBufOut = CharBuffer.allocate(2);
 // Due to the awful design of ByteBuffer, we need to maintain write position ourself
 int writePosition = 0;
 // Decode byte by byte
 while (byteSequence.remaining() > 0) {
 decodeBufIn.put(writePosition++, byteSequence.get());
 //Switch to read mode
 decodeBufIn.limit(writePosition);
 CoderResult r = decoder.decode(decodeBufIn, decodeBufOut, false);
 //Once the decoder produce an outcome , consume it
 if (r.isUnderflow() || r.isOverflow()) {
 if (decodeBufOut.position() > 0) {
 decodeBufOut.flip();
 decodeResult.append(decodeBufOut);
 decodeBufOut.clear();
 decodeBufIn.clear();
 writePosition = 0;
 }
 }else{
 r.throwException();
 }
 //Switch to write mode
 decodeBufIn.limit(decodeBufIn.capacity());
 if (writePosition >= decodeBufIn.capacity()) {
 throw new IllegalStateException("This should never occur!");
 }
 }
 System.out.println(decodeResult);
}

Stephen C Stephen C 722k95 gold badges849 silver badges1.3k bronze badges · Accepted Answer · 2013-02-09 23:59:57Z

My theory is that the problem with the way that you are doing it is that in the "underflow" condition, the decoder leaves the unconsumed bytes in the input buffer. At least, that is my reading.

Note this sentence in the javadoc:

"In any case, if this method is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation. "

But you are clobbering the (presumably) unread byte.

You should be able to check whether my theory / interpretation is correct by looking at how many bytes remain unconsumed in bb after the first decode(...) call.

If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.

Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.

Yes I also noticed this now. But it is strange that this implementation relies on me to copy unconsumed bytes to the new buffer. Also this means that buffer can't be shorter than the longest character decoded. Particularly this means that it is IMPOSSIBLE to decode byte by byte.
@SuzanCioc - not impossible. You just have to do it slightly differently.
but how? Decoder won't accept one byte and won't remember it. So I am obliged to feed it with 2 bytes (in current case). So I need at least 2-byte buffer. No way to feed by byte!
@SuzanCioc - Yes you need a buffer with a capacity of up to 6 bytes. But you can still keep adding bytes one by one ... which should satisfy your higher-level requirement of byte-by-byte decoding. Think outside the box!.

CollectivesTM on Stack Overflow

Decoding multibyte UTF8 symbols with charset decoder in byte-by-byte manner?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related