HTML cleaner in JavaScript

Question 1

Please review code quality and give me shortcuts or optimization tips as well as corrections.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Clean HTML Code</title>
</head>
<body>
<p><textarea id="code" cols="80" rows="15" autofocus></textarea></p>
<p><input id="comments" type="checkbox" checked> <label for="comments">Remove All HTML Comments</label></p>
<p><input id="head" type="checkbox"> <label for="head">Remove Head Tags</label></p>
<p><input id="style" type="checkbox"> <label for="style">Remove Style Tags</label></p>
<p><input id="script" type="checkbox"> <label for="script">Remove All Script Tags (also removes NoScript Tags)</label></p>
<p><button id="btn">Clean Up HTML</button></p>
<script>
 var code = $("code"),
 comments = $("comments"),
 head = $("head"),
 style = $("style"),
 script = $("script"),
 btn = $("btn");
 function $(id) {
 return document.getElementById(id);
 }
 function cleanUpHtml() {
 var html = code.value;
 // http://blog.gotux.net/tutorial/clean-html-code-using-regex-in-ruby/
 html = html.replace(/(\n|\t|\r)/g, ' '); // Remove Newlines, Tabs and Carriage Returns
 html = html.replace(/>\s*</g, '><'); // Remove Spaces Between Tags
 if (comments.checked) html = html.replace(/<!--.*?-->/im, ''); // Remove All HTML Comments
 if (head.checked) html = html.replace(/<head.*?<\/head>/im, ''); // Remove Head Tags
 if (style.checked) html = html.replace(/<style.*?<\/style>/im, ''); // Remove Style Tags
 if (script.checked) {
 html = html.replace(/<script.*?<\/script>/im, ''); // Remove All Script Tags
 html = html.replace(/<noscript.*?<\/noscript>/im, ''); // Remove All NoScript Tags
 html = html.replace(/on.=".*?"/ig, ''); // Remove All Event Handler Attributes (eg. onclick)
 }
 html = html.replace(/<form.*?<\/form>/im, ''); // Remove All Form Elements
 // html.squeeze!(' ') // Remove Spaces Between Strings
 html = html.replace(/\s{2,}/g, " ") // finally, remove extra spaces
 code.value = html;
 }
 btn.addEventListener("click", cleanUpHtml, false);
</script>
</body>
</html>

Question 2

I don’t know what this would be used for, but if it were for cleaning input by untrusted users, it is woefully inadequate. An example, and this is just the start: <script>alert('xss');</script > (note inserted space). For other uses this might do its job, but in most cases you’ll want something that actually understands HTML rather than sifting through it with regular expressions.

Question 3

this can be used for compress html files.

Question 4

Could you perhaps explain why you would want to do this in js on the client side? You would have to do it all over on the server, as you can never trust input from the client. Why bother then? I use htmlpurifier.org for stuff like this (no need to re-invent the wheel), and if need be, you can do an ajax call to run the cleanig on the server.

Question 5

First of all, you can't parse HTML with regex. RegEx is totally the wrong person for the job. Your code there assumes that the input HTML is properly formatted. If given broken HTML, you code won't even stand a chance. I suggest you use a parser instead. Creating one isn't trivial, so you might want to look for one out there.

Now, I see you use $ and I thought it was jQuery until I saw the code define it. Avoid using $. Name your function verbosely instead. That way, you don't confuse people to thinking it is jQuery when it isn't.

Now, speaking of jQuery, you could theoretically use jQuery to strip out HTML since it has a DOM parser and handy DOM functions:

// Get the HTML
var html = $('#code').val();
var DOM = $(html);
// Collect selectors to remove
var toRemove = [];
if(head.checked) toRemove.push('head');
if(style.checked) toRemove.push('style');
if(script.checked) toRemove.push('script');
if(form.checked) toRemove.push('form');
DOM.remove(toRemove.join(','));
if(comments.checked) DOM = DOM.filter('*');
var cleanHTML = DOM.html();

Just a theory. Dunno how it would really perform.

Joseph 25.5k2 gold badges27 silver badges38 bronze badges · Answer 1 · 2014-05-14 07:59:30Z

First of all, you can't parse HTML with regex. RegEx is totally the wrong person for the job. Your code there assumes that the input HTML is properly formatted. If given broken HTML, you code won't even stand a chance. I suggest you use a parser instead. Creating one isn't trivial, so you might want to look for one out there.

Now, I see you use $ and I thought it was jQuery until I saw the code define it. Avoid using $. Name your function verbosely instead. That way, you don't confuse people to thinking it is jQuery when it isn't.

Now, speaking of jQuery, you could theoretically use jQuery to strip out HTML since it has a DOM parser and handy DOM functions:

// Get the HTML
var html = $('#code').val();
var DOM = $(html);
// Collect selectors to remove
var toRemove = [];
if(head.checked) toRemove.push('head');
if(style.checked) toRemove.push('style');
if(script.checked) toRemove.push('script');
if(form.checked) toRemove.push('form');
DOM.remove(toRemove.join(','));
if(comments.checked) DOM = DOM.filter('*');
var cleanHTML = DOM.html();

Just a theory. Dunno how it would really perform.

Stack Exchange Network

HTML cleaner in JavaScript

1 Answer 1

You must log in to answer this question.

Hot Network Questions

HTML cleaner in JavaScript

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions