Byte counter snippet

Question 1

Some people have been linking to off-site resources for counting the length of an answer (mostly https://mothereff.in/byte-counter). Some people also pointed out that it only counts in UTF-8.

Well, I was bored so I created a snippet. It counts in UTF-8 if you paste in code, and counts in pure binary bytes if you either drop and drag a file, or select a file from a dialog. Raw non-minified source is here and a full page version can be found here.

It requires a modern browser supporting the HTML5 File API.

Byte counter

<!DOCTYPE html><html><head><style type=text/css>html,body{margin:0;height:100%;overflow-y:hidden;font-family:'Helvetica Neue',Helvetica,Arial,sans-serif}#wrapper{overflow-y:hidden;margin:0;min-height:100%;padding:10px}#fileinput{display:none}#bytes,#chars{white-space:nowrap;font-weight:bold;font-size:20px;padding-right:10px}td{vertical-align:middle}table{margin-bottom:10px;margin-right:80px}#textinput{width:100%;box-sizing:border-box}</style><!--[if lte IE 6]><style type=text/css>#container{height:100%}</style><![endif]--></head><body><div id=wrapper><table><tr><td id=bytes>0 bytes</td><td rowspan=2>Drag and drop a file anywhere on this snippet, <a href=# id=fileselect>select a file using a dialog</a>, or enter UTF-8 code in the textbox.</td></tr><tr><td id=chars>0 chars</td></tr></table><input type=file id=fileinput onchange=handle_file(this.files)><textarea id=textinput onkeyup=textbox(this.value) onchange=textbox(this.value)></textarea></div><script src=https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js></script><script type=text/javascript>function nodefault(a){a.stopPropagation();a.preventDefault()}function handle_file(b){var a=new FileReader();a.onload=function(c){$("#chars").text(a.result.length+" chars")};a.readAsText(b[0],"UTF-8");$("#bytes").text(b[0].size+" bytes")}function textbox(a){$("#bytes").text((new Blob([a],{encoding:"UTF-8",type:"text/plain;charset=UTF-8"})).size+" bytes");$("#chars").text(a.length+" chars")}function drop(a){nodefault(a);handle_file(a.dataTransfer.files)}function click(a){nodefault(a);$("#fileinput")[0].click()}$(document).ready(function(){var a=function(){$("#textinput").height($(window).height()-$("#textinput").offset().top-20)};$(window).resize(a);a()});document.body.addEventListener("dragenter",nodefault,false);document.body.addEventListener("dragover",nodefault,false);document.body.addEventListener("drop",drop,false);$("#fileselect").on("click",click);</script></body></html>

Question 2

There was already a discussion about how to present Stack Snippets, with no clear consensus. I suppose if nobody has a problem with this question, it's fine to have one snippet per meta question.

Question 3

@Doorknob I feel that a format of "one question per snippet" is highly superior, because comments lack threading, unlimited editing, downvoting, deleting, etc. Also, seeing there's currently a whopping 2 stack snippet presenting questions (at least tagged stack-snippets), I don't see it being a problem either.

Question 4

WOW! This thing works with files as big as 600MB!!! WOW!!!

Question 5

Do you know what might be causing this?

Question 6

@Calvin'sHobbies Not yet, I'll look into it this week (sorry, busy schedule, might be today, might be saturday).

Question 7

Feature Request: Indicate if all characters are in ISO-8859-1

If all characters have code points less than 256, one can encode the file in ISO-8859-1, which simply uses one byte for each character. It would be neat if the snippet checked whether that is possible, such that the character count (see other feature request) can be used directly as the byte count.

Question 8

Would you mind giving me some test strings, and expected outputs for them compared to what it does now?

Question 9

@orlp Essentially, if the input only contains characters generated by this script then there should be some indication that all code points are less than 256. If the input contains something else, like one of π∫√, then it shouldn't.

Question 10

What's so special about ISO-8859-1? There are other codepages, like Windows-1252, ISO-8859-2, ...

Question 11

@anatolyg What's "special" about ISO-8859-1 is that it corresponds to the first 256 codepoints of Unicode. So in most languages if you 256 characters they will likely be those from this encoding. Of course, there might be some specific situations where you want to use a different encoding, but in those case you've probably selected it carefully enough that you know about it yourself.

Question 12

This is a downvote right away. ISO-8859-1 may be the USA standard, but it breaks a lot with the € (euro) symbol. Or even windows-1252. I hate utf8_encode and utf8_decode in PHP due to that. What were they thinking??? Why choose a setting that only works well for English-speaking countries? ISO-8859-15 is the best solution.

Question 13

@IsmaelMiguel Please re-read my previous comment. The only reason I mentioned this is that by default (i.e. for golfing purposes) most languages will simply give you ISO-8859-1 for the first 256 code points, because those are the first 256 code points defined in Unicode. If you need your specific language's page, then you will probably know about this and don't need to be informed by the snippet.

Question 14

@MartinBüttner Then also add windows-1252 and ISO-8859-15.

Question 15

Feature Request: Correctly count characters from supplementary planes

This is minor, but could still be looked at if it's easy to change.

When using some characters, such as emojis, UTF-16 or UCS-2 (which JavaScript uses) uses uses two "characters" for them.

emoji taking 2 chars in a JS string

However, they are defined to be single characters in Unicode. It would be nice to correctly count them, as they may be significant in challenges counting by characters.

Question 16

Oh, I thought Javascript used UCS-4, or at least that .length returned the number of Unicode points. I'll look into it.

Question 17

@orlp: Here is a lazy implementation: function cpCount(s) { return s.replace(/[\ud800-\udbff][\udc00-\udfff]/g, '0').length } Basically, it replaces all valid surrogate pairs with a character that is one UTF-16 code unit and count the length. Unpaired surrogate counts as one character.

Martin Ender Martin Ender 198k14 gold badges181 silver badges309 bronze badges · Answer 1 · 2015-03-25 14:00:54Z

10

\$\begingroup\$

Feature Request: Indicate if all characters are in ISO-8859-1

If all characters have code points less than 256, one can encode the file in ISO-8859-1, which simply uses one byte for each character. It would be neat if the snippet checked whether that is possible, such that the character count (see other feature request) can be used directly as the byte count.

Share

edited Mar 25, 2015 at 14:20

answered Mar 25, 2015 at 14:00

Martin Ender's user avatar

Martin Ender Martin Ender

198k14 gold badges181 silver badges309 bronze badges

\$\endgroup\$

7

\$\begingroup\$ Would you mind giving me some test strings, and expected outputs for them compared to what it does now? \$\endgroup\$

orlp
– orlp

2015年03月25日 14:40:00 +00:00
Commented Mar 25, 2015 at 14:40
\$\begingroup\$ @orlp Essentially, if the input only contains characters generated by this script then there should be some indication that all code points are less than 256. If the input contains something else, like one of π∫√, then it shouldn't. \$\endgroup\$

Martin Ender
– Martin Ender

2015年03月25日 16:14:19 +00:00
Commented Mar 25, 2015 at 16:14
\$\begingroup\$ What's so special about ISO-8859-1? There are other codepages, like Windows-1252, ISO-8859-2, ... \$\endgroup\$

anatolyg
– anatolyg

2015年03月30日 12:47:17 +00:00
Commented Mar 30, 2015 at 12:47
1

\$\begingroup\$ @anatolyg What's "special" about ISO-8859-1 is that it corresponds to the first 256 codepoints of Unicode. So in most languages if you 256 characters they will likely be those from this encoding. Of course, there might be some specific situations where you want to use a different encoding, but in those case you've probably selected it carefully enough that you know about it yourself. \$\endgroup\$

Martin Ender
– Martin Ender

2015年03月30日 13:08:07 +00:00
Commented Mar 30, 2015 at 13:08
\$\begingroup\$ This is a downvote right away. ISO-8859-1 may be the USA standard, but it breaks a lot with the € (euro) symbol. Or even windows-1252. I hate utf8_encode and utf8_decode in PHP due to that. What were they thinking??? Why choose a setting that only works well for English-speaking countries? ISO-8859-15 is the best solution. \$\endgroup\$

Ismael Miguel
– Ismael Miguel

2015年04月01日 15:22:55 +00:00
Commented Apr 1, 2015 at 15:22
1

\$\begingroup\$ @IsmaelMiguel Please re-read my previous comment. The only reason I mentioned this is that by default (i.e. for golfing purposes) most languages will simply give you ISO-8859-1 for the first 256 code points, because those are the first 256 code points defined in Unicode. If you need your specific language's page, then you will probably know about this and don't need to be informed by the snippet. \$\endgroup\$

Martin Ender
– Martin Ender

2015年04月01日 17:21:47 +00:00
Commented Apr 1, 2015 at 17:21
\$\begingroup\$ @MartinBüttner Then also add windows-1252 and ISO-8859-15. \$\endgroup\$

Ismael Miguel
– Ismael Miguel

2015年04月01日 17:35:00 +00:00
Commented Apr 1, 2015 at 17:35

Add a comment |

PurkkaKoodari PurkkaKoodari 17.9k11 silver badges12 bronze badges · Answer 2 · 2015-04-03 21:06:24Z

3

\$\begingroup\$

Feature Request: Correctly count characters from supplementary planes

This is minor, but could still be looked at if it's easy to change.

When using some characters, such as emojis, UTF-16 or UCS-2 (which JavaScript uses) uses uses two "characters" for them.

emoji taking 2 chars in a JS string

However, they are defined to be single characters in Unicode. It would be nice to correctly count them, as they may be significant in challenges counting by characters.

Share

edited Jun 17, 2020 at 9:03

Community's user avatar

Community Bot

1

answered Apr 3, 2015 at 21:06

PurkkaKoodari's user avatar

PurkkaKoodari PurkkaKoodari

17.9k11 silver badges12 bronze badges

\$\endgroup\$

2

\$\begingroup\$ Oh, I thought Javascript used UCS-4, or at least that .length returned the number of Unicode points. I'll look into it. \$\endgroup\$

orlp
– orlp

2015年04月03日 21:12:33 +00:00
Commented Apr 3, 2015 at 21:12
\$\begingroup\$ @orlp: Here is a lazy implementation: function cpCount(s) { return s.replace(/[\ud800-\udbff][\udc00-\udfff]/g, '0').length } Basically, it replaces all valid surrogate pairs with a character that is one UTF-16 code unit and count the length. Unpaired surrogate counts as one character. \$\endgroup\$

n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳
– n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳

2015年04月06日 10:35:50 +00:00
Commented Apr 6, 2015 at 10:35

Add a comment |

Stack Exchange Network

Byte counter snippet

Byte counter

2 Answers 2

Feature Request: Indicate if all characters are in ISO-8859-1

Feature Request: Correctly count characters from supplementary planes

You must log in to answer this question.

Linked

Hot Network Questions

Byte counter snippet

Byte counter

2 Answers 2

Feature Request: Indicate if all characters are in ISO-8859-1

Feature Request: Correctly count characters from supplementary planes

You must log in to answer this question.

Linked

Related

Hot Network Questions