Count byte length of string

Question 1

I am looking for some guidance and optimization pointers for my custom JavaScript function which counts the bytes in a string rather than just chars. The website uses UTF-8 and I am looking to maintain IE8 compatibility.

/**
 * Count bytes in string
 *
 * Count and return the number of bytes in a given string
 *
 * @access public
 * @param string
 * @return int
 */
function getByteLen(normal_val)
{
 // Force string type
 normal_val = String(normal_val);
 // Split original string into array
 var normal_pieces = normal_val.split('');
 // Get length of original array
 var normal_length = normal_pieces.length;
 // Declare array for encoded normal array
 var encoded_pieces = new Array();
 // Declare array for individual byte pieces
 var byte_pieces = new Array();
 // Loop through normal pieces and convert to URL friendly format
 for(var i = 0; i <= normal_length; i++)
 {
 if(normal_pieces[i] && normal_pieces[i] != '')
 {
 encoded_pieces[i] = encodeURI(normal_pieces[i]);
 }
 }
 // Get length of encoded array
 var encoded_length = encoded_pieces.length;
 // Loop through encoded array
 // Scan individual items for a %
 // Split on % and add to byte array
 // If no % exists then add to byte array
 for(var i = 0; i <= encoded_length; i++)
 {
 if(encoded_pieces[i] && encoded_pieces[i] != '')
 {
 // % exists
 if(encoded_pieces[i].indexOf('%') != -1)
 {
 // Split on %
 var split_code = encoded_pieces[i].split('%');
 // Get length
 var split_length = split_code.length;
 // Loop through pieces
 for(var j = 0; j <= split_length; j++)
 {
 if(split_code[j] && split_code[j] != '')
 {
 // Push to byte array
 byte_pieces.push(split_code[j]);
 }
 }
 }
 else
 {
 // No percent
 // Push to byte array
 byte_pieces.push(encoded_pieces[i]);
 }
 }
 }
 // Array length is the number of bytes in string
 var byte_length = byte_pieces.length;
 return byte_length;
}

Question 2

Here is an independent and efficient method to count UTF-8 bytes of a string. Note that the method may throw error if an input string is UCS-2 malformed.

Question 3

It would be a lot simpler to work out the length yourself rather than to interpret the results of encodeURI().

/**
 * Count bytes in a string's UTF-8 representation.
 *
 * @param string
 * @return int
 */
function getByteLen(normal_val) {
 // Force string type
 normal_val = String(normal_val);
 var byteLen = 0;
 for (var i = 0; i < normal_val.length; i++) {
 var c = normal_val.charCodeAt(i);
 byteLen += (c & 0xf800) == 0xd800 ? 2 : // Code point is half of a surrogate pair
 c < (1 << 7) ? 1 :
 c < (1 << 11) ? 2 : 3;
 }
 return byteLen;
}

JavaScript implementations may use either UCS-2 or UTF-16 to represent strings.

UCS-2 only supports Unicode code points up to U+FFFF, and such Unicode characters occupy 1, 2, or 3 bytes in their UTF-8 representation. This is not too tricky to handle.

However, as @Mac points out, UTF-16 surrogate pairs are a tricky special case. UTF-16 extends UCS-2 by adding support for code points U+10000 to U+10FFFF, which UTF-16 encodes using a pair of code points. The first code point of such a pair (called the "high surrogate") is in the range D800 to DBFF; it should always be followed by another code point (called the "low surrogate") is in the range DC00 to DFFF. Observe that the UTF-8 representation of any character in the range U+10000 to U+10FFFF would take 4 bytes. Therefore, any surrogate pair in UTF-16 would translate to a 4-byte UTF-8 representation. Or, we could say that whenever we encounter half of a surrogate pair (i.e., a code point is in the range from D800 to DFFF), just add two bytes to the UTF-8 length.

Question 4

Nice, I saw similar but less clean code on the site I linked to, +1

Question 5

Good question! The code at forrst.com is bogus. Although ceil(log_256(charCode)) tells you the number of bytes it would take to represent charCode, there's nothing about UTF-8 in their byteLength() function. UTF-8 is a variable-length encoding scheme, and the few most-significant bits of every byte are necessary to indicate how many bytes form each character. Since any variable-length encoding scheme will have such padding, their byteLength() function gives a wrong answer for any encoding, including UTF-8.

Question 6

The 4-byte limit for UTF-8 derives from the decision to cap Unicode code points to U+10FFFF. However, it takes no additional effort to add two more cases, so I would code defensively.

Question 7

getByteLength( '😀' ) returns 6, but should be 4.

Question 8

@Mac Addressed your bug report in Rev 2!

Question 9

My 2 cents

Please do not abbreviate words, choose short words or acronyms ( Len -> Length )
Please lower camel case ( normal_val -> normalValue )
Consider using spartan conventions ( s -> generic string )
new Array() is considered old skool, consider var byte_pieces = []
You are using byte_pieces to track the bytes just to get the length, you could have just kept track of the length, this would be more efficient
I am not sure what abnormal pieces would be here:

if(normal_pieces[i] && normal_pieces[i] != '')

You check again for these here, probably not needed:

if(encoded_pieces[i] && encoded_pieces[i] != '')

You could just do return byte_pieces.length instead of

// Array length is the number of bytes in string
var byte_length = byte_pieces.length;
return byte_length;

All that together, I would counter propose something like this:

function getByteCount( s )
{
 var count = 0, stringLength = s.length, i;
 s = String( s || "" );
 for( i = 0 ; i < stringLength ; i++ )
 {
 var partCount = encodeURI( s[i] ).split("%").length;
 count += partCount==1?1:partCount-1;
 }
 return count;
}
getByteCount("i ♥ js");
getByteCount("abc def");

You could get the sum by using .reduce(), I leave that as an exercise to the reader.

Finally, if you are truly concerned about performance, there are some very fancy performant js libraries out there.

Question 10

Thank you so much, looks like a lot of good stuff in your post. I will give them a go and see if I can get better performance numbers. I am not overly concerned about performance but my original code took ~6 seconds for 1200 iterations of 2400 Euro signs deduced by one char per iteration until I hit 1200 for my enforceMaxByteLength script and this code took ~3.8 so hopefully I can shave off a bit more

Question 11

Your counter proposition is genius, it shaved another .6 seconds off my benchmark, thank you.

Question 12

You can try this:

var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));

It worked for me.

Question 13

This only works for strings that consist solely of code points up to U+03FF. It fails to account for any Unicode characters whose UTF-8 representation requires 3 or more bytes.

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Accepted Answer · 2013-12-16 23:05:20Z

It would be a lot simpler to work out the length yourself rather than to interpret the results of encodeURI().

/**
 * Count bytes in a string's UTF-8 representation.
 *
 * @param string
 * @return int
 */
function getByteLen(normal_val) {
 // Force string type
 normal_val = String(normal_val);
 var byteLen = 0;
 for (var i = 0; i < normal_val.length; i++) {
 var c = normal_val.charCodeAt(i);
 byteLen += (c & 0xf800) == 0xd800 ? 2 : // Code point is half of a surrogate pair
 c < (1 << 7) ? 1 :
 c < (1 << 11) ? 2 : 3;
 }
 return byteLen;
}

JavaScript implementations may use either UCS-2 or UTF-16 to represent strings.

UCS-2 only supports Unicode code points up to U+FFFF, and such Unicode characters occupy 1, 2, or 3 bytes in their UTF-8 representation. This is not too tricky to handle.

However, as @Mac points out, UTF-16 surrogate pairs are a tricky special case. UTF-16 extends UCS-2 by adding support for code points U+10000 to U+10FFFF, which UTF-16 encodes using a pair of code points. The first code point of such a pair (called the "high surrogate") is in the range D800 to DBFF; it should always be followed by another code point (called the "low surrogate") is in the range DC00 to DFFF. Observe that the UTF-8 representation of any character in the range U+10000 to U+10FFFF would take 4 bytes. Therefore, any surrogate pair in UTF-16 would translate to a 4-byte UTF-8 representation. Or, we could say that whenever we encounter half of a surrogate pair (i.e., a code point is in the range from D800 to DFFF), just add two bytes to the UTF-8 length.

Nice, I saw similar but less clean code on the site I linked to, +1
Good question! The code at forrst.com is bogus. Although ceil(log_256(charCode)) tells you the number of bytes it would take to represent charCode, there's nothing about UTF-8 in their byteLength() function. UTF-8 is a variable-length encoding scheme, and the few most-significant bits of every byte are necessary to indicate how many bytes form each character. Since any variable-length encoding scheme will have such padding, their byteLength() function gives a wrong answer for any encoding, including UTF-8.
The 4-byte limit for UTF-8 derives from the decision to cap Unicode code points to U+10FFFF. However, it takes no additional effort to add two more cases, so I would code defensively.

Stack Exchange Network

Count byte length of string

3 Answers 3

Hot Network Questions

Count byte length of string

3 Answers 3

Related

Hot Network Questions