I have a place where i have to include a HTML template. The HTML is written by employees only but i dont wanna be an idiot and include it without masking|checks :)
It should allow HTML tags only without any attributes.
So no <a href...
links and no <div style=...
divs or w/e.
My test script:
$string
= '
<p>
<strong>Foo</strong>
</p>
<p>Bar</p>
<p>
<strong>Baz</strong> Mmmpf
</p>
<ul>
<li>someting</li>
<li>someting more</li>
<li>even more</li>
</ul>
<p>
<strong>Foo</strong>
</p>
<i>Foo <u>Bar</u></i>Baz
<!-- xss -->
<script>alert(1)</script>
<p onmouseover="alert(1)"></p>
<!-- ... -->
';
$htmlWhitelist = [
'u',
'i',
'p',
'strong',
'ul',
'li',
];
// replace allowed tags with placeholders
// that not get changed by htmlspecialchars()
foreach ($htmlWhitelist as $tag) {
$string = str_replace(
["<{$tag}>", "</{$tag}>"],
["{OPEN}{$tag}{OPEN}", "{CLOSE}{$tag}{CLOSE}"],
$string
);
}
// htmlspecialchars() on everything
$string = htmlspecialchars($string);
// put back the allowed tags
foreach ($htmlWhitelist as $tag) {
$string = str_replace(
["{OPEN}{$tag}{OPEN}", "{CLOSE}{$tag}{CLOSE}"],
["<{$tag}>", "</{$tag}>"],
$string
);
}
I cannot imagine anything could go wrong with this but would like to ask you guys if i missed someting.
1 Answer 1
I did some testing and research on this, AFAIK your script is safe.
However, you should also be aware of how you retrieve the input. For example, what if the String contained a "
, followed by executing PHP code. This would be an even worse vulnerability than malicious client-side code.
This post states old versions of IE may be vulnerable if your char-set is UTF-7, which it probably isn't.
If <div style...>
is entered, it will be escaped. Therefore <div>
and <span>
should be included.
I believe every tag can have an onload=script
, (E.G <u onload="script">
), so only allowing tags by themselves is a good idea. (As you are already doing)
I suggest adding div
, span
, b
, br
and many other tags to the whitelist.
Edit: Since you mentioned you use PHP to echo
the output, I also suggest testing using PHP functions or variables in the input, such as $_SERVER['REMOTE_ADDR']
.
Will the output show the result of that variable, or the literal text?
-
\$\begingroup\$ Indeed it outputs a
$var
.$string = "<span>some html...</span> $var <div>..."; echo $string;
-ofc executes|includes the variable - didnt thought about that. So an somebody that has access to the templates could add</span> ... {$_SERVER['...']} ...
to get it printed|executed. How do i prevent that? \$\endgroup\$cottton– cottton2019年08月15日 15:45:02 +00:00Commented Aug 15, 2019 at 15:45 -
1\$\begingroup\$ No wait - forget it. I receive and handle a string. I was confused for a moment. Nothing in the string gets interpreted. I would have to use eval to do so, but ofc i dont :D \$\endgroup\$cottton– cottton2019年08月15日 15:50:54 +00:00Commented Aug 15, 2019 at 15:50
$string
? I'd like to see how the output is used \$\endgroup\$$string
is really retrieved? I'd like to see how you get the user input \$\endgroup\$<?php echo cleanHtmlTagsOnly($string); ?>
to write it into the output, wherecleanHtmlTagsOnly()
is the code in the question. \$\endgroup\$