What is the maximum number of bytes for a single UTF-8 encoded character?
I'll be encrypting the bytes of a String encoded in UTF-8 and therefore need to be able to work out the maximum number of bytes for a UTF-8 encoded String.
Could someone confirm the maximum number of bytes for a single UTF-8 encoded character please
-
2You did look at common resources, such as Wikipedia's UTF-8 Article, first ... right?user166390– user16639003/02/2012 12:38:43Commented Mar 2, 2012 at 12:38
-
6I read several articles which gave mixed answers... I actually got the impression the answer was 3 so I'm very glad I askedEdd– Edd03/02/2012 12:43:53Commented Mar 2, 2012 at 12:43
-
3I will leave a youtube link here, featuring Tom Scott's Characters, Symbols, Unicode miracle: youtube.com/watch?v=MijmeoH9LT4. You get to hear and see how everything's being evolved from ASCII character encoding to utf-8.Roy Lee– Roy Lee12/24/2015 11:36:07Commented Dec 24, 2015 at 11:36
-
See also Calculating length in UTF-8 of Java String without actually encoding it for length-computing code exampleVadzim– Vadzim05/16/2019 17:57:53Commented May 16, 2019 at 17:57
6 Answers 6
The maximum number of bytes per character is 4 according to RFC3629 which limited the character table to U+10FFFF
:
In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.
(The original specification allowed for up to six byte character codes for code points past U+10FFFF
.)
Characters with a code less than 128 will require 1 byte only, and the next 1920 character codes require 2 bytes only. Unless you are working with an esoteric language, multiplying the character count by 4 will be a significant overestimation.
-
9What is "esotheric language" for you? Any language which would exist in the real-world, or a text which switches between different languages of the world? Should a developer of an UTF-8-to-String function choose 2, 3 or 4 as multiplicator if he does a over-allocation and the downsizes the result after the actual convertion?Daniel Marschall– Daniel Marschall06/06/2014 07:35:54Commented Jun 6, 2014 at 7:35
-
2@rinntech by 'esoteric language' he means a language that has a lot of high value unicode chars (something from near the bottom of this list: unicode-table.com/en/sections ). If you must over-allocate, choose 4. You could do a double pass, one to see how many bytes you'll need and allocate, then another to do the encoding; that may be better than allocating ~4 times the RAM needed.matiu– matiu09/10/2014 19:36:22Commented Sep 10, 2014 at 19:36
-
10Always try to handle worst case: hacker9.com/single-message-can-crash-whatsapp.htmlEvgen Bodunov– Evgen Bodunov12/23/2015 07:51:29Commented Dec 23, 2015 at 7:51
-
27CJKV characters mostly take 3 bytes (with some rare/archaic characters taking 4 bytes) and calling them esoteric is a bit of a stretch (China alone is almost 20% of the world's population...).Tgr– Tgr02/08/2016 18:23:57Commented Feb 8, 2016 at 18:23
-
5Why was it limited to 4 when it was previously 6? What stops us from continuing the standard and having a lead byte of
11111111
and having a2^(6*7)
bit space for characters?Aaron Franke– Aaron Franke10/31/2019 16:37:10Commented Oct 31, 2019 at 16:37
Without further context, I would say that the maximum number of bytes for a character in UTF-8 is
answer: 6 bytes
The author of the accepted answer correctly pointed this out as the "original specification". That was valid through RFC-2279 1. As J. Cocoe pointed out in the comments below, this changed in 2003 with RFC-3629 2, which limits UTF-8 to encoding for 21 bits, which can be handled with the encoding scheme using four bytes.
answer if covering all unicode: 4 bytes
But, in Java <= v7, they talk about a 3-byte maximum for representing unicode with UTF-8? That's because the original unicode specification only defined the basic multi-lingual plane (BMP), i.e. it is an older version of unicode, or subset of modern unicode. So
answer if representing only original unicode, the BMP: 3 bytes
But, the OP talks about going the other way. Not from characters to UTF-8 bytes, but from UTF-8 bytes to a "String" of bytes representation. Perhaps the author of the accepted answer got that from the context of the question, but this is not necessarily obvious, so may confuse the casual reader of this question.
Going from UTF-8 to native encoding, we have to look at how the "String" is implemented. Some languages, like Python >= 3 will represent each character with integer code points, which allows for 4 bytes per character = 32 bits to cover the 21 we need for unicode, with some waste. Why not exactly 21 bits? Because things are faster when they are byte-aligned. Some languages like Python <= 2 and Java represent characters using a UTF-16 encoding, which means that they have to use surrogate pairs to represent extended unicode (not BMP). Either way that's still 4 bytes maximum.
answer if going UTF-8 -> native encoding: 4 bytes
So, final conclusion, 4 is the most common right answer, so we got it right. But, mileage could vary.
-
6"this is still the current and correct specification, per wikipedia" -- not any more. Shortly after you wrote this (April 2nd edit), Wikipedia's UTF-8 article was changed to clarify that the 6-octet version isn't part of the current (2003) UTF-8 spec.J. Cocoe– J. Cocoe08/27/2016 01:50:23Commented Aug 27, 2016 at 1:50
-
"But, in Java <= v7, they talk about a 3-byte maximum for representing unicode with UTF-8? That's because the original unicode specification only defined the basic multi-lingual plane" -- That is probably the original reason, but it's not the whole story. Java uses "modified UTF-8", and one of the modifications is that it "uses its own two-times-three-byte format" instead of "the four-byte format of standard UTF-8" (their words).J. Cocoe– J. Cocoe08/27/2016 01:52:04Commented Aug 27, 2016 at 1:52
-
1There are no codepoints allocated above the 10FFFF (just over a million) limit and many of the UTF8 implementations never implemented sequences longer than 4 bytes (and some only 3, eg MySQL) so I would consider it safe to hard limit to 4 bytes per codepoint even when considering compatibility with older implementations. You would just need to ensure you discard anything invalid on the way in. Note that matiu's recommendation of allocating after calculating exact byte length is a good one where possible.thomasrutter– thomasrutter05/25/2017 04:53:20Commented May 25, 2017 at 4:53
-
3"... [U]nicode can represent up to x10FFFF code points. So, including 0, that means we can do it with these bytes: F FF FF, i.e. two-and-a-half bytes, or 20 bits." I believe this is a bit incorrect. The number of code points from 0x0 through 0x10FFFF would be 0x110000, which could be represented in
1F FF FF
, or 21 bits. The 0x110000 number corresponds to the 17 planes of 0x10000 code points each.neuralmer– neuralmer01/24/2018 19:08:31Commented Jan 24, 2018 at 19:08 -
2PSA: Wikipedia is not a real source. Look at the article's actual references.Nyerguds– Nyerguds03/08/2019 11:57:06Commented Mar 8, 2019 at 11:57
It depends what you mean by "character":
- if you mean "code point", then the answer is 4 bytes.
- if you mean "grapheme" (which is what most people think of when they say "character"), then the answer is that there is no maximum.
Indeed, consider the family emoji '👨👩👧👦': this grapheme is represented in Unicode using 7 code points: U+1F468 (Man), U+200D (Zero Width Joiner), U+1F469 (Woman), U+200D + U+1F467 (Girl), U+200D + U+1F466 (Boy).
Each code point is encoded to UTF-8 using 1 to 4 bytes, and in this particular example they add up to 25 bytes! But it could be much larger, since you can combine arbitrarily many code points to form complex characters. This is the wonderful world of "Extended Grapheme Clusters".
English text uses slightly over 1 byte per grapheme on average, while Hindi uses just under 4 bytes per grapheme on average. And emojis typically use 4 to 12 bytes. In short: it depends!
So if you limit each character to a fixed number of bytes, you're infringing on people's freedom of speech, you're not family friendly 👨👩👧👦, you're not pirate friendly 🏴☠️ , you're vexillophobic 🇯🇵🇰🇷🇩🇪🇨🇳🇺🇸🇫🇷🇪🇸🇮🇹🇷🇺🇬🇧, and worse of all, you're anti-Zalg: H̸͍͖̖̎̂́͠e̷̩̻̦̽̆̐͑̊́l̷̛̖̜̇̌̚͠l̷͖̮̮̞͂͊͋̃͆͜͝o̸̞̻̗͋͂̍̂͝.
-
1The word "grapheme" is used in several libraries and languages to designate a glyph represented in UTF-8 by several code points, or actually any glyph (an extended grapheme cluster like "🇬🇧", a single byte like "a" or a multibyte like "é"). Some documentations tend nowadays to avoid to use "character" in that context, as it's too ambiguous: they use bytes / code points / graphemes.Dereckson– Dereckson11/11/2024 23:18:06Commented Nov 11, 2024 at 23:18
-
Thanks @Dereckson, I clarified my answer.MiniQuark– MiniQuark02/09/2025 10:23:20Commented Feb 9 at 10:23
The maximum number of bytes to support US-ASCII, a standard English alphabet encoding, is 1. But limiting text to English is becoming less desirable or practical as time goes by.
Unicode was designed to represent the glyphs of all human languages, as well as many kinds of symbols, with a variety of rendering characteristics. UTF-8 is an efficient encoding for Unicode, although still biased toward English. UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction.
While the maximum number of bytes per UTF-8 character is 3 for supporting just the 2-byte address space of Plane 0, the Basic Multilingual Plane (BMP), which can be accepted as minimal support in some applications, it is 4 for supporting all 17 current planes of Unicode (as of 2019). It should be noted that many popular "emoji" characters are likely to be located in Plane 16, which requires 4 bytes.
However, this is just for basic character glyphs. There are also various modifiers, such as making accents appear over the previous character, and it is also possible to link together an arbitrary number of code points to construct one complex "grapheme". In real world programming, therefore, the use or assumption of a fixed maximum number of bytes per character will likely eventually result in a problem for your application.
These considerations imply that UTF-8 character strings should not "expanded" into arrays of fixed length prior to processing, as has sometimes been done. Instead, programming should be done directly, using string functions specifically designed for UTF-8.
-
Note: the paragraph about not using a fixed-width array of characters is my own opinion. I'm willing to edit this answer in response to comments.David Spector– David Spector01/10/2020 14:29:28Commented Jan 10, 2020 at 14:29
-
Also note that Klingon is in unicode too, so it's not just all human language. As for your recommendation, it will all come down to what you're optimizing for and what benchmarks tell you. Sometimes, it's faster to rip through a known number of bytes without conditional logic or branching. Branching can harm performance severely. If you preprocessed it, you'd have to do the branching still, but at least the heavier computation stuff would be ripping through contiguous memory without zero branches. If you want to optimize for space, it's not a good idea though.user904963– user90496303/22/2022 05:20:43Commented Mar 22, 2022 at 5:20
-
2Klingon is a human language, meaning that it was designed by Marc Okrand and other humans to achieve human purposes. Klingon is not an extraterrestrial language, since the planet Klingon does not exist. As to your apparent defense of the common practice of using six-byte arrays for internal handling of characters, we will have to agree to disagree. Such limits are bugs.David Spector– David Spector03/23/2022 11:15:18Commented Mar 23, 2022 at 11:15
-
With UTF encoding, the max number of bytes is 4. Depending on the symbols used, you can get away with 1 byte (e.g. English with punctuation) or 2 bytes (If you know there aren't emoji, Chinese, Japanese, etc.). The advantage of preprocessing comes into play more strongly if you run algorithms on the text multiple times. Otherwise, you will have a bunch of branching each time you run an algorithm (although your CPU's branch detector will help a lot if the symbols used result in predictable branching). I didn't say preprocessing is better, only that it can be and testing is needed.user904963– user90496303/24/2022 04:23:03Commented Mar 24, 2022 at 4:23
-
The minimum number of bytes needed when using a fixed-length array is 6 if you wish to encode emoji, which are quire popular these days. In my own coding, I have found that there is no need to program using fixed-length arrays at all. Whatever you are trying to do can probably be achieved using either byte-oriented programming or by obtaining the actual character length by scanning the UTF-8 bytes.David Spector– David Spector03/25/2022 13:00:45Commented Mar 25, 2022 at 13:00
Max value in Unicode is 0010FFFF, which encoded into utf-8 will outputs 4 bytes.
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
But according to utf-8's specs utf-8 can support value up to 7FFFFFFF which ouputs 6 bytes.
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Condidering just technical limitations - it's possible to have up to 7 bytes following current UTF8 encoding scheme. According to it - if first byte is not self-sufficient ASCII character, than it should have pattern: 1(n)0X(7-n), where n is <= 7.
Also theoretically it could be 8 but then first byte would have no zero bit at all. While other aspects, like continuation byte differing from leading, are still there (allowing error detection), I heared, that byte 11111111 could be invalid, but I can't be sure about that.
Limitatation for max 4 bytes is most likely for compatibility with UTF-16, which I tend to consider a legacy, because the only quality where it excels, is processing speed, but only if string byte order matches (i.e. we read 0xFEFF in the BOM).
Explore related questions
See similar questions with these tags.