17
\$\begingroup\$

I am looking for some guidance and optimization pointers for my custom JavaScript function which counts the bytes in a string rather than just chars. The website uses UTF-8 and I am looking to maintain IE8 compatibility.

/**
 * Count bytes in string
 *
 * Count and return the number of bytes in a given string
 *
 * @access public
 * @param string
 * @return int
 */
function getByteLen(normal_val)
{
 // Force string type
 normal_val = String(normal_val);
 // Split original string into array
 var normal_pieces = normal_val.split('');
 // Get length of original array
 var normal_length = normal_pieces.length;
 // Declare array for encoded normal array
 var encoded_pieces = new Array();
 // Declare array for individual byte pieces
 var byte_pieces = new Array();
 // Loop through normal pieces and convert to URL friendly format
 for(var i = 0; i <= normal_length; i++)
 {
 if(normal_pieces[i] && normal_pieces[i] != '')
 {
 encoded_pieces[i] = encodeURI(normal_pieces[i]);
 }
 }
 // Get length of encoded array
 var encoded_length = encoded_pieces.length;
 // Loop through encoded array
 // Scan individual items for a %
 // Split on % and add to byte array
 // If no % exists then add to byte array
 for(var i = 0; i <= encoded_length; i++)
 {
 if(encoded_pieces[i] && encoded_pieces[i] != '')
 {
 // % exists
 if(encoded_pieces[i].indexOf('%') != -1)
 {
 // Split on %
 var split_code = encoded_pieces[i].split('%');
 // Get length
 var split_length = split_code.length;
 // Loop through pieces
 for(var j = 0; j <= split_length; j++)
 {
 if(split_code[j] && split_code[j] != '')
 {
 // Push to byte array
 byte_pieces.push(split_code[j]);
 }
 }
 }
 else
 {
 // No percent
 // Push to byte array
 byte_pieces.push(encoded_pieces[i]);
 }
 }
 }
 // Array length is the number of bytes in string
 var byte_length = byte_pieces.length;
 return byte_length;
}
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Dec 16, 2013 at 16:22
\$\endgroup\$
1
  • \$\begingroup\$ Here is an independent and efficient method to count UTF-8 bytes of a string. Note that the method may throw error if an input string is UCS-2 malformed. \$\endgroup\$ Commented Jan 21, 2016 at 10:14

3 Answers 3

15
\$\begingroup\$

It would be a lot simpler to work out the length yourself rather than to interpret the results of encodeURI().

/**
 * Count bytes in a string's UTF-8 representation.
 *
 * @param string
 * @return int
 */
function getByteLen(normal_val) {
 // Force string type
 normal_val = String(normal_val);
 var byteLen = 0;
 for (var i = 0; i < normal_val.length; i++) {
 var c = normal_val.charCodeAt(i);
 byteLen += (c & 0xf800) == 0xd800 ? 2 : // Code point is half of a surrogate pair
 c < (1 << 7) ? 1 :
 c < (1 << 11) ? 2 : 3;
 }
 return byteLen;
}

JavaScript implementations may use either UCS-2 or UTF-16 to represent strings.

UCS-2 only supports Unicode code points up to U+FFFF, and such Unicode characters occupy 1, 2, or 3 bytes in their UTF-8 representation. This is not too tricky to handle.

However, as @Mac points out, UTF-16 surrogate pairs are a tricky special case. UTF-16 extends UCS-2 by adding support for code points U+10000 to U+10FFFF, which UTF-16 encodes using a pair of code points. The first code point of such a pair (called the "high surrogate") is in the range D800 to DBFF; it should always be followed by another code point (called the "low surrogate") is in the range DC00 to DFFF. Observe that the UTF-8 representation of any character in the range U+10000 to U+10FFFF would take 4 bytes. Therefore, any surrogate pair in UTF-16 would translate to a 4-byte UTF-8 representation. Or, we could say that whenever we encounter half of a surrogate pair (i.e., a code point is in the range from D800 to DFFF), just add two bytes to the UTF-8 length.

answered Dec 16, 2013 at 23:05
\$\endgroup\$
9
  • \$\begingroup\$ Nice, I saw similar but less clean code on the site I linked to, +1 \$\endgroup\$ Commented Dec 17, 2013 at 13:23
  • 2
    \$\begingroup\$ Good question! The code at forrst.com is bogus. Although ceil(log_256(charCode)) tells you the number of bytes it would take to represent charCode, there's nothing about UTF-8 in their byteLength() function. UTF-8 is a variable-length encoding scheme, and the few most-significant bits of every byte are necessary to indicate how many bytes form each character. Since any variable-length encoding scheme will have such padding, their byteLength() function gives a wrong answer for any encoding, including UTF-8. \$\endgroup\$ Commented Dec 17, 2013 at 22:44
  • 1
    \$\begingroup\$ The 4-byte limit for UTF-8 derives from the decision to cap Unicode code points to U+10FFFF. However, it takes no additional effort to add two more cases, so I would code defensively. \$\endgroup\$ Commented Dec 18, 2013 at 17:22
  • 2
    \$\begingroup\$ getByteLength( '😀' ) returns 6, but should be 4. \$\endgroup\$ Commented May 15, 2017 at 16:21
  • 2
    \$\begingroup\$ @Mac Addressed your bug report in Rev 2! \$\endgroup\$ Commented Nov 10, 2020 at 11:48
8
\$\begingroup\$

My 2 cents

  • Please do not abbreviate words, choose short words or acronyms ( Len -> Length )
  • Please lower camel case ( normal_val -> normalValue )
  • Consider using spartan conventions ( s -> generic string )
  • new Array() is considered old skool, consider var byte_pieces = []
  • You are using byte_pieces to track the bytes just to get the length, you could have just kept track of the length, this would be more efficient
  • I am not sure what abnormal pieces would be here:

if(normal_pieces[i] && normal_pieces[i] != '')

  • You check again for these here, probably not needed:

if(encoded_pieces[i] && encoded_pieces[i] != '')

  • You could just do return byte_pieces.length instead of
// Array length is the number of bytes in string
var byte_length = byte_pieces.length;
return byte_length;

All that together, I would counter propose something like this:

function getByteCount( s )
{
 var count = 0, stringLength = s.length, i;
 s = String( s || "" );
 for( i = 0 ; i < stringLength ; i++ )
 {
 var partCount = encodeURI( s[i] ).split("%").length;
 count += partCount==1?1:partCount-1;
 }
 return count;
}
getByteCount("i ♥ js");
getByteCount("abc def");

You could get the sum by using .reduce(), I leave that as an exercise to the reader.

Finally, if you are truly concerned about performance, there are some very fancy performant js libraries out there.

answered Dec 16, 2013 at 18:35
\$\endgroup\$
2
  • 1
    \$\begingroup\$ Thank you so much, looks like a lot of good stuff in your post. I will give them a go and see if I can get better performance numbers. I am not overly concerned about performance but my original code took ~6 seconds for 1200 iterations of 2400 Euro signs deduced by one char per iteration until I hit 1200 for my enforceMaxByteLength script and this code took ~3.8 so hopefully I can shave off a bit more \$\endgroup\$ Commented Dec 16, 2013 at 18:51
  • 1
    \$\begingroup\$ Your counter proposition is genius, it shaved another .6 seconds off my benchmark, thank you. \$\endgroup\$ Commented Dec 16, 2013 at 19:14
0
\$\begingroup\$

You can try this:

var b = str.match(/[^\x00-\xff]/g);
return (str.length + (!b ? 0: b.length));

It worked for me.

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
answered Jun 11, 2014 at 2:54
\$\endgroup\$
1
  • \$\begingroup\$ This only works for strings that consist solely of code points up to U+03FF. It fails to account for any Unicode characters whose UTF-8 representation requires 3 or more bytes. \$\endgroup\$ Commented Nov 10, 2020 at 11:54

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.