I need to remove inline javascript for a given string. Examples:
If user typed: <img onload="something" />
I should need to convert into <img />
I created this PHP code and it works(apparently without issues):
http://writecodeonline.com/php/
function test_input($input){
//I have a list with all events but for this example I used two
$html_events = 'onload|onclick';
$pattern = "/(<[A-Z][A-Z0-9]*[^>]*)($html_events)([\s]*=[\s]*)('[^>]*'|\"[^>]*\")([^>]*>)/i";
$replacement = '1ドル5ドル';
while( preg_match($pattern, $input) ){
$input = preg_replace($pattern, $replacement, $input);
}
return htmlentities($input);
}
echo test_input('<img onload="alert(\'hello world\');" onclick="alert(\'hello world\');" />'). '<br />';
echo test_input('<img onload="alert(\'hello world\');"/>'). '<br />';
echo test_input('<div onload="alert(\'hello world\');" onclick="alert(\'hello world\');">hello buddies</div>'). '<br />';
I'm just looking for improvements or use cases that I did not supporting or that break my regex. I would appreciate if you tell me:
This: test_input('something bad');
breaks your regex.
Or if found an improvement that in a benchmark demonstrate better performance I should be happy to apply it as long as it does not break use cases already supported.
Thank You!
Update I finally used htmlpurifier
2 Answers 2
Parsing markup with regex is like building your house using lego... it's not the right tool for the job. HTML is not a regular language, therefore regular expressions don't cut the mustard. More than that: You're actively working to bring the world as we know it to an end, which drives people insane
What you need is a DOM parser, and as luck would have it, PHP has the DOMDocument
object, which is just that:
$dom = new DOMDocument;
$dom->loadHTML('<img onload="alert(\'hello world\');" onclick="alert(\'hello world\');" />');
$nodes = $dom->getElementsByTagName('*');//just get all nodes,
//$dom->getElementsByTagName('img'); would work, too
foreach($nodes as $node)
{
if ($node->hasAttribute('onload'))
{
$node->removeAttribute('onload');
}
if ($node->hasAttribute('onclick'))
{
$node->removeAttribute('onclick');
}
}
echo $dom->saveHTML();//will include html, head, body tags and doctype
Tadaa... both onload
and onclick
have been removed from the markup, without the pain of writing a reliable and stable regex, that can deal with in-line JS... As an added bonus, this code will be far more maintainable (and expandable) in the future. I'd much prefer maintaining this code, than having to rework a regular expression somebody wrote a couple of months ago...
If you want, you can echo only the tags you've changed, like so:
$changed = array();
$attributesOfDeath = array('onload', 'onclick');
foreach($nodes as $node)
{
$current = null;
foreach($attributesOfDeath as $attr)
{
if ($node->hasAttribute($attr))
{
$node->removeAttribute($attr);
$current = $node;
}
}
if ($current)
{
$changed[] = $current;//add to changed array
}
}
$changed = array_map(array($dom, 'saveXML'), $changed);
echo implode(PHP_EOL, $changed);
As Jan said, for maintainability it's best to use an array of "forbidden attributes". That's what the $attributesOfDeath
array is for. If you want to, later on, check for a third or fourth attribute, you can simply add that to the array, and nothing else in your code need change. It'll just keep on working as before.
-
\$\begingroup\$ for the purposes of maintainability, you should put
onload
andonclick
in a single array and iterate over it. Otherwise, I agree with the answer. \$\endgroup\$John Dvorak– John Dvorak2013年08月21日 20:33:23 +00:00Commented Aug 21, 2013 at 20:33 -
\$\begingroup\$ @JanDvorak: Not only that, though I did think about doing that in my code example: when using this code, it might be worth while using a class (assigning the
DOMDocument
instance to a property, so as to not create a new instance on each function call). Anyway, I've edited my answer, and now use an$attributesOfDeath
array in my second example \$\endgroup\$Elias Van Ootegem– Elias Van Ootegem2013年08月21日 20:37:18 +00:00Commented Aug 21, 2013 at 20:37
Regex to Strip all Inline JS
You can use the following Regex to Strip off Inline JS
/\bon\w+=\S+(?=.*>)/g
onload
oronlclick
attribute is found remove it \$\endgroup\$