Some people have been linking to off-site resources for counting the length of an answer (mostly https://mothereff.in/byte-counter). Some people also pointed out that it only counts in UTF-8.
Well, I was bored so I created a snippet. It counts in UTF-8 if you paste in code, and counts in pure binary bytes if you either drop and drag a file, or select a file from a dialog. Raw non-minified source is here and a full page version can be found here.
It requires a modern browser supporting the HTML5 File API.
Byte counter
<!DOCTYPE html><html><head><style type=text/css>html,body{margin:0;height:100%;overflow-y:hidden;font-family:'Helvetica Neue',Helvetica,Arial,sans-serif}#wrapper{overflow-y:hidden;margin:0;min-height:100%;padding:10px}#fileinput{display:none}#bytes,#chars{white-space:nowrap;font-weight:bold;font-size:20px;padding-right:10px}td{vertical-align:middle}table{margin-bottom:10px;margin-right:80px}#textinput{width:100%;box-sizing:border-box}</style><!--[if lte IE 6]><style type=text/css>#container{height:100%}</style><![endif]--></head><body><div id=wrapper><table><tr><td id=bytes>0 bytes</td><td rowspan=2>Drag and drop a file anywhere on this snippet, <a href=# id=fileselect>select a file using a dialog</a>, or enter UTF-8 code in the textbox.</td></tr><tr><td id=chars>0 chars</td></tr></table><input type=file id=fileinput onchange=handle_file(this.files)><textarea id=textinput onkeyup=textbox(this.value) onchange=textbox(this.value)></textarea></div><script src=https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js></script><script type=text/javascript>function nodefault(a){a.stopPropagation();a.preventDefault()}function handle_file(b){var a=new FileReader();a.onload=function(c){$("#chars").text(a.result.length+" chars")};a.readAsText(b[0],"UTF-8");$("#bytes").text(b[0].size+" bytes")}function textbox(a){$("#bytes").text((new Blob([a],{encoding:"UTF-8",type:"text/plain;charset=UTF-8"})).size+" bytes");$("#chars").text(a.length+" chars")}function drop(a){nodefault(a);handle_file(a.dataTransfer.files)}function click(a){nodefault(a);$("#fileinput")[0].click()}$(document).ready(function(){var a=function(){$("#textinput").height($(window).height()-$("#textinput").offset().top-20)};$(window).resize(a);a()});document.body.addEventListener("dragenter",nodefault,false);document.body.addEventListener("dragover",nodefault,false);document.body.addEventListener("drop",drop,false);$("#fileselect").on("click",click);</script></body></html>
-
\$\begingroup\$ There was already a discussion about how to present Stack Snippets, with no clear consensus. I suppose if nobody has a problem with this question, it's fine to have one snippet per meta question. \$\endgroup\$Doorknob– Doorknob Mod2015年03月25日 13:56:17 +00:00Commented Mar 25, 2015 at 13:56
-
\$\begingroup\$ @Doorknob I feel that a format of "one question per snippet" is highly superior, because comments lack threading, unlimited editing, downvoting, deleting, etc. Also, seeing there's currently a whopping 2 stack snippet presenting questions (at least tagged stack-snippets), I don't see it being a problem either. \$\endgroup\$orlp– orlp2015年03月25日 14:00:03 +00:00Commented Mar 25, 2015 at 14:00
-
\$\begingroup\$ WOW! This thing works with files as big as 600MB!!! WOW!!! \$\endgroup\$Ismael Miguel– Ismael Miguel2015年04月01日 15:24:45 +00:00Commented Apr 1, 2015 at 15:24
-
\$\begingroup\$ Do you know what might be causing this? \$\endgroup\$Calvin's Hobbies– Calvin's Hobbies2015年04月08日 02:57:51 +00:00Commented Apr 8, 2015 at 2:57
-
\$\begingroup\$ @Calvin'sHobbies Not yet, I'll look into it this week (sorry, busy schedule, might be today, might be saturday). \$\endgroup\$orlp– orlp2015年04月08日 10:31:25 +00:00Commented Apr 8, 2015 at 10:31
2 Answers 2
Feature Request: Indicate if all characters are in ISO-8859-1
If all characters have code points less than 256, one can encode the file in ISO-8859-1, which simply uses one byte for each character. It would be neat if the snippet checked whether that is possible, such that the character count (see other feature request) can be used directly as the byte count.
-
\$\begingroup\$ Would you mind giving me some test strings, and expected outputs for them compared to what it does now? \$\endgroup\$orlp– orlp2015年03月25日 14:40:00 +00:00Commented Mar 25, 2015 at 14:40
-
\$\begingroup\$ @orlp Essentially, if the input only contains characters generated by this script then there should be some indication that all code points are less than 256. If the input contains something else, like one of
π∫√
, then it shouldn't. \$\endgroup\$Martin Ender– Martin Ender2015年03月25日 16:14:19 +00:00Commented Mar 25, 2015 at 16:14 -
\$\begingroup\$ What's so special about ISO-8859-1? There are other codepages, like Windows-1252, ISO-8859-2, ... \$\endgroup\$anatolyg– anatolyg2015年03月30日 12:47:17 +00:00Commented Mar 30, 2015 at 12:47
-
1\$\begingroup\$ @anatolyg What's "special" about ISO-8859-1 is that it corresponds to the first 256 codepoints of Unicode. So in most languages if you 256 characters they will likely be those from this encoding. Of course, there might be some specific situations where you want to use a different encoding, but in those case you've probably selected it carefully enough that you know about it yourself. \$\endgroup\$Martin Ender– Martin Ender2015年03月30日 13:08:07 +00:00Commented Mar 30, 2015 at 13:08
-
\$\begingroup\$ This is a downvote right away. ISO-8859-1 may be the USA standard, but it breaks a lot with the
€
(euro) symbol. Or even windows-1252. I hateutf8_encode
andutf8_decode
in PHP due to that. What were they thinking??? Why choose a setting that only works well for English-speaking countries? ISO-8859-15 is the best solution. \$\endgroup\$Ismael Miguel– Ismael Miguel2015年04月01日 15:22:55 +00:00Commented Apr 1, 2015 at 15:22 -
1\$\begingroup\$ @IsmaelMiguel Please re-read my previous comment. The only reason I mentioned this is that by default (i.e. for golfing purposes) most languages will simply give you ISO-8859-1 for the first 256 code points, because those are the first 256 code points defined in Unicode. If you need your specific language's page, then you will probably know about this and don't need to be informed by the snippet. \$\endgroup\$Martin Ender– Martin Ender2015年04月01日 17:21:47 +00:00Commented Apr 1, 2015 at 17:21
-
\$\begingroup\$ @MartinBüttner Then also add windows-1252 and ISO-8859-15. \$\endgroup\$Ismael Miguel– Ismael Miguel2015年04月01日 17:35:00 +00:00Commented Apr 1, 2015 at 17:35
Feature Request: Correctly count characters from supplementary planes
This is minor, but could still be looked at if it's easy to change.
When using some characters, such as emojis, UTF-16 or UCS-2 (which JavaScript uses) uses uses two "characters" for them.
emoji taking 2 chars in a JS string
However, they are defined to be single characters in Unicode. It would be nice to correctly count them, as they may be significant in challenges counting by characters.
-
\$\begingroup\$ Oh, I thought Javascript used UCS-4, or at least that
.length
returned the number of Unicode points. I'll look into it. \$\endgroup\$orlp– orlp2015年04月03日 21:12:33 +00:00Commented Apr 3, 2015 at 21:12 -
\$\begingroup\$ @orlp: Here is a lazy implementation:
function cpCount(s) { return s.replace(/[\ud800-\udbff][\udc00-\udfff]/g, '0').length }
Basically, it replaces all valid surrogate pairs with a character that is one UTF-16 code unit and count the length. Unpaired surrogate counts as one character. \$\endgroup\$n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳– n̴̖̋h̷͉̃a̷̭̿h̸̡̅ẗ̵̨́d̷̰̀ĥ̷̳2015年04月06日 10:35:50 +00:00Commented Apr 6, 2015 at 10:35