Please review code quality and give me shortcuts or optimization tips as well as corrections.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Clean HTML Code</title>
</head>
<body>
<p><textarea id="code" cols="80" rows="15" autofocus></textarea></p>
<p><input id="comments" type="checkbox" checked> <label for="comments">Remove All HTML Comments</label></p>
<p><input id="head" type="checkbox"> <label for="head">Remove Head Tags</label></p>
<p><input id="style" type="checkbox"> <label for="style">Remove Style Tags</label></p>
<p><input id="script" type="checkbox"> <label for="script">Remove All Script Tags (also removes NoScript Tags)</label></p>
<p><button id="btn">Clean Up HTML</button></p>
<script>
var code = $("code"),
comments = $("comments"),
head = $("head"),
style = $("style"),
script = $("script"),
btn = $("btn");
function $(id) {
return document.getElementById(id);
}
function cleanUpHtml() {
var html = code.value;
// http://blog.gotux.net/tutorial/clean-html-code-using-regex-in-ruby/
html = html.replace(/(\n|\t|\r)/g, ' '); // Remove Newlines, Tabs and Carriage Returns
html = html.replace(/>\s*</g, '><'); // Remove Spaces Between Tags
if (comments.checked) html = html.replace(/<!--.*?-->/im, ''); // Remove All HTML Comments
if (head.checked) html = html.replace(/<head.*?<\/head>/im, ''); // Remove Head Tags
if (style.checked) html = html.replace(/<style.*?<\/style>/im, ''); // Remove Style Tags
if (script.checked) {
html = html.replace(/<script.*?<\/script>/im, ''); // Remove All Script Tags
html = html.replace(/<noscript.*?<\/noscript>/im, ''); // Remove All NoScript Tags
html = html.replace(/on.=".*?"/ig, ''); // Remove All Event Handler Attributes (eg. onclick)
}
html = html.replace(/<form.*?<\/form>/im, ''); // Remove All Form Elements
// html.squeeze!(' ') // Remove Spaces Between Strings
html = html.replace(/\s{2,}/g, " ") // finally, remove extra spaces
code.value = html;
}
btn.addEventListener("click", cleanUpHtml, false);
</script>
</body>
</html>
1 Answer 1
First of all, you can't parse HTML with regex. RegEx is totally the wrong person for the job. Your code there assumes that the input HTML is properly formatted. If given broken HTML, you code won't even stand a chance. I suggest you use a parser instead. Creating one isn't trivial, so you might want to look for one out there.
Now, I see you use $ and I thought it was jQuery until I saw the code define it. Avoid using $. Name your function verbosely instead. That way, you don't confuse people to thinking it is jQuery when it isn't.
Now, speaking of jQuery, you could theoretically use jQuery to strip out HTML since it has a DOM parser and handy DOM functions:
// Get the HTML
var html = $('#code').val();
var DOM = $(html);
// Collect selectors to remove
var toRemove = [];
if(head.checked) toRemove.push('head');
if(style.checked) toRemove.push('style');
if(script.checked) toRemove.push('script');
if(form.checked) toRemove.push('form');
DOM.remove(toRemove.join(','));
if(comments.checked) DOM = DOM.filter('*');
var cleanHTML = DOM.html();
Just a theory. Dunno how it would really perform.
<script>alert('xss');</script >(note inserted space). For other uses this might do its job, but in most cases you’ll want something that actually understands HTML rather than sifting through it with regular expressions. \$\endgroup\$