I am trying to decode UTF8 byte by byte with charset decoder. Is this possible?
The following code
public static void main(String[] args) {
Charset cs = Charset.forName("utf8");
CharsetDecoder decoder = cs.newDecoder();
CoderResult res;
byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
byte[] b = new byte[1];
ByteBuffer bb = ByteBuffer.wrap(b);
char[] c = new char[1];
CharBuffer cb = CharBuffer.wrap(c);
decoder.reset();
b[0] = source[0];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false);
System.out.println(res);
System.out.println(cb.remaining());
b[0] = source[1];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false);
System.out.println(res);
System.out.println(cb.remaining());
}
gives the following output.
UNDERFLOW
1
MALFORMED[1]
1
Why?
-
@jlordo these reasons are offtopic in this questionSuzan Cioc– Suzan Cioc02/09/2013 23:21:36Commented Feb 9, 2013 at 23:21
3 Answers 3
My theory is that the problem with the way that you are doing it is that in the "underflow" condition, the decoder leaves the unconsumed bytes in the input buffer. At least, that is my reading.
Note this sentence in the javadoc:
"In any case, if this method is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation. "
But you are clobbering the (presumably) unread byte.
You should be able to check whether my theory / interpretation is correct by looking at how many bytes remain unconsumed in bb
after the first decode(...)
call.
If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.
Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.
-
Yes I also noticed this now. But it is strange that this implementation relies on me to copy unconsumed bytes to the new buffer. Also this means that buffer can't be shorter than the longest character decoded. Particularly this means that it is IMPOSSIBLE to decode byte by byte.Suzan Cioc– Suzan Cioc02/10/2013 00:04:16Commented Feb 10, 2013 at 0:04
-
@SuzanCioc - not impossible. You just have to do it slightly differently.Stephen C– Stephen C02/10/2013 00:06:01Commented Feb 10, 2013 at 0:06
-
but how? Decoder won't accept one byte and won't remember it. So I am obliged to feed it with 2 bytes (in current case). So I need at least 2-byte buffer. No way to feed by byte!Suzan Cioc– Suzan Cioc02/10/2013 00:12:00Commented Feb 10, 2013 at 0:12
-
@SuzanCioc - Yes you need a buffer with a capacity of up to 6 bytes. But you can still keep adding bytes one by one ... which should satisfy your higher-level requirement of byte-by-byte decoding. Think outside the box!.Stephen C– Stephen C02/10/2013 00:16:02Commented Feb 10, 2013 at 0:16
As has been said, utf has 1-6 bytes per char. you need to add all bytes to the bytebuffer before you decode try this:
public static void main(String[] args) {
Charset cs = Charset.forName("utf8");
CharsetDecoder decoder = cs.newDecoder();
CoderResult res;
byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8
byte[] b = new byte[2]; //two bytes for this char
ByteBuffer bb = ByteBuffer.wrap(b);
char[] c = new char[1];
CharBuffer cb = CharBuffer.wrap(c);
decoder.reset();
b[0] = source[0];
b[1] = source[1];
bb.rewind();
cb.rewind();
res = decoder.decode(bb, cb, false); //translates 2 bytes to 1 char
System.out.println(cb.remaining()); //prints 0
System.out.println(cb.get(0)); //prints latin ae
}
-
2UTF-8 has anywhere from 1 to 6 bytes per characterSimon G.– Simon G.02/09/2013 23:34:40Commented Feb 9, 2013 at 23:34
-
How can I know in advance, how many bytes should I allocate? Suppose I will add one more byte, but it also can appear to be malformed.Suzan Cioc– Suzan Cioc02/09/2013 23:39:56Commented Feb 9, 2013 at 23:39
-
1Allocate for six bytes. As long as the
CharsetDecoder
can read at least one full character at a time, it'll be happy; it'll just leave the extra bytes in theByteBuffer
, where you shouldcompact
them.Louis Wasserman– Louis Wasserman02/09/2013 23:42:11Commented Feb 9, 2013 at 23:42 -
3LouisWasserman @SimonG. You're both wrong. UTF-8 can contain max. 4 bytes per character. See this SO question or my blog post on this topic.Stijn de Witt– Stijn de Witt08/08/2014 22:52:46Commented Aug 8, 2014 at 22:52
Here is my solution. The following decode a utf-8 byte sequence, in a byte by byte manner.
public static void main(String[] args) {
//The utf-8 bytes sequences that we'll decode it
ByteBuffer byteSequence = ByteBuffer.wrap(
"Привет Hello 你好 こんにちは 안녕하세요,😂".getBytes(StandardCharsets.UTF_8)
);
StringBuilder decodeResult = new StringBuilder();
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
ByteBuffer decodeBufIn = ByteBuffer.allocate(4);
CharBuffer decodeBufOut = CharBuffer.allocate(2);
// Due to the awful design of ByteBuffer, we need to maintain write position ourself
int writePosition = 0;
// Decode byte by byte
while (byteSequence.remaining() > 0) {
decodeBufIn.put(writePosition++, byteSequence.get());
//Switch to read mode
decodeBufIn.limit(writePosition);
CoderResult r = decoder.decode(decodeBufIn, decodeBufOut, false);
//Once the decoder produce an outcome , consume it
if (r.isUnderflow() || r.isOverflow()) {
if (decodeBufOut.position() > 0) {
decodeBufOut.flip();
decodeResult.append(decodeBufOut);
decodeBufOut.clear();
decodeBufIn.clear();
writePosition = 0;
}
}else{
r.throwException();
}
//Switch to write mode
decodeBufIn.limit(decodeBufIn.capacity());
if (writePosition >= decodeBufIn.capacity()) {
throw new IllegalStateException("This should never occur!");
}
}
System.out.println(decodeResult);
}