I am looking for some guidance and optimization pointers for my custom JavaScript function which counts the bytes in a string rather than just chars. The website uses UTF-8 and I am looking to maintain IE8 compatibility.
/**
* Count bytes in string
*
* Count and return the number of bytes in a given string
*
* @access public
* @param string
* @return int
*/
function getByteLen(normal_val)
{
// Force string type
normal_val = String(normal_val);
// Split original string into array
var normal_pieces = normal_val.split('');
// Get length of original array
var normal_length = normal_pieces.length;
// Declare array for encoded normal array
var encoded_pieces = new Array();
// Declare array for individual byte pieces
var byte_pieces = new Array();
// Loop through normal pieces and convert to URL friendly format
for(var i = 0; i <= normal_length; i++)
{
if(normal_pieces[i] && normal_pieces[i] != '')
{
encoded_pieces[i] = encodeURI(normal_pieces[i]);
}
}
// Get length of encoded array
var encoded_length = encoded_pieces.length;
// Loop through encoded array
// Scan individual items for a %
// Split on % and add to byte array
// If no % exists then add to byte array
for(var i = 0; i <= encoded_length; i++)
{
if(encoded_pieces[i] && encoded_pieces[i] != '')
{
// % exists
if(encoded_pieces[i].indexOf('%') != -1)
{
// Split on %
var split_code = encoded_pieces[i].split('%');
// Get length
var split_length = split_code.length;
// Loop through pieces
for(var j = 0; j <= split_length; j++)
{
if(split_code[j] && split_code[j] != '')
{
// Push to byte array
byte_pieces.push(split_code[j]);
}
}
}
else
{
// No percent
// Push to byte array
byte_pieces.push(encoded_pieces[i]);
}
}
}
// Array length is the number of bytes in string
var byte_length = byte_pieces.length;
return byte_length;
}
-
\$\begingroup\$ Here is an independent and efficient method to count UTF-8 bytes of a string. Note that the method may throw error if an input string is UCS-2 malformed. \$\endgroup\$fuweichin– fuweichin2016年01月21日 10:14:13 +00:00Commented Jan 21, 2016 at 10:14
3 Answers 3
It would be a lot simpler to work out the length yourself rather than to interpret the results of encodeURI()
.
/**
* Count bytes in a string's UTF-8 representation.
*
* @param string
* @return int
*/
function getByteLen(normal_val) {
// Force string type
normal_val = String(normal_val);
var byteLen = 0;
for (var i = 0; i < normal_val.length; i++) {
var c = normal_val.charCodeAt(i);
byteLen += (c & 0xf800) == 0xd800 ? 2 : // Code point is half of a surrogate pair
c < (1 << 7) ? 1 :
c < (1 << 11) ? 2 : 3;
}
return byteLen;
}
JavaScript implementations may use either UCS-2 or UTF-16 to represent strings.
UCS-2 only supports Unicode code points up to U+FFFF, and such Unicode characters occupy 1, 2, or 3 bytes in their UTF-8 representation. This is not too tricky to handle.
However, as @Mac points out, UTF-16 surrogate pairs are a tricky special case. UTF-16 extends UCS-2 by adding support for code points U+10000 to U+10FFFF, which UTF-16 encodes using a pair of code points. The first code point of such a pair (called the "high surrogate") is in the range D800 to DBFF; it should always be followed by another code point (called the "low surrogate") is in the range DC00 to DFFF. Observe that the UTF-8 representation of any character in the range U+10000 to U+10FFFF would take 4 bytes. Therefore, any surrogate pair in UTF-16 would translate to a 4-byte UTF-8 representation. Or, we could say that whenever we encounter half of a surrogate pair (i.e., a code point is in the range from D800 to DFFF), just add two bytes to the UTF-8 length.
-
\$\begingroup\$ Nice, I saw similar but less clean code on the site I linked to, +1 \$\endgroup\$konijn– konijn2013年12月17日 13:23:25 +00:00Commented Dec 17, 2013 at 13:23
-
2\$\begingroup\$ Good question! The code at forrst.com is bogus. Although
ceil(log_256(charCode))
tells you the number of bytes it would take to representcharCode
, there's nothing about UTF-8 in theirbyteLength()
function. UTF-8 is a variable-length encoding scheme, and the few most-significant bits of every byte are necessary to indicate how many bytes form each character. Since any variable-length encoding scheme will have such padding, theirbyteLength()
function gives a wrong answer for any encoding, including UTF-8. \$\endgroup\$200_success– 200_success2013年12月17日 22:44:08 +00:00Commented Dec 17, 2013 at 22:44 -
1\$\begingroup\$ The 4-byte limit for UTF-8 derives from the decision to cap Unicode code points to U+10FFFF. However, it takes no additional effort to add two more cases, so I would code defensively. \$\endgroup\$200_success– 200_success2013年12月18日 17:22:08 +00:00Commented Dec 18, 2013 at 17:22
-
2\$\begingroup\$ getByteLength( '😀' ) returns 6, but should be 4. \$\endgroup\$Mac– Mac2017年05月15日 16:21:30 +00:00Commented May 15, 2017 at 16:21
-
2\$\begingroup\$ @Mac Addressed your bug report in Rev 2! \$\endgroup\$200_success– 200_success2020年11月10日 11:48:11 +00:00Commented Nov 10, 2020 at 11:48
My 2 cents
- Please do not abbreviate words, choose short words or acronyms ( Len -> Length )
- Please lower camel case ( normal_val -> normalValue )
- Consider using spartan conventions ( s -> generic string )
new Array()
is considered old skool, considervar byte_pieces = []
- You are using
byte_pieces
to track the bytes just to get the length, you could have just kept track of the length, this would be more efficient - I am not sure what
abnormal pieces
would be here:
if(normal_pieces[i] && normal_pieces[i] != '')
- You check again for these here, probably not needed:
if(encoded_pieces[i] && encoded_pieces[i] != '')
- You could just do
return byte_pieces.length
instead of
// Array length is the number of bytes in string var byte_length = byte_pieces.length; return byte_length;
All that together, I would counter propose something like this:
function getByteCount( s )
{
var count = 0, stringLength = s.length, i;
s = String( s || "" );
for( i = 0 ; i < stringLength ; i++ )
{
var partCount = encodeURI( s[i] ).split("%").length;
count += partCount==1?1:partCount-1;
}
return count;
}
getByteCount("i ♥ js");
getByteCount("abc def");
You could get the sum by using .reduce()
, I leave that as an exercise to the reader.
Finally, if you are truly concerned about performance, there are some very fancy performant js libraries out there.
-
1\$\begingroup\$ Thank you so much, looks like a lot of good stuff in your post. I will give them a go and see if I can get better performance numbers. I am not overly concerned about performance but my original code took ~6 seconds for 1200 iterations of 2400 Euro signs deduced by one char per iteration until I hit 1200 for my enforceMaxByteLength script and this code took ~3.8 so hopefully I can shave off a bit more \$\endgroup\$MonkeyZeus– MonkeyZeus2013年12月16日 18:51:14 +00:00Commented Dec 16, 2013 at 18:51
-
1\$\begingroup\$ Your counter proposition is genius, it shaved another .6 seconds off my benchmark, thank you. \$\endgroup\$MonkeyZeus– MonkeyZeus2013年12月16日 19:14:13 +00:00Commented Dec 16, 2013 at 19:14
You can try this:
var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));
It worked for me.
-
\$\begingroup\$ This only works for strings that consist solely of code points up to U+03FF. It fails to account for any Unicode characters whose UTF-8 representation requires 3 or more bytes. \$\endgroup\$200_success– 200_success2020年11月10日 11:54:25 +00:00Commented Nov 10, 2020 at 11:54
Explore related questions
See similar questions with these tags.