Comments

Url: optimized unescape() performance#32

Closed

JanTvrdik wants to merge 1 commit intonette:master from

JanTvrdik:url_unescape_performance

Closed

Url: optimized unescape() performance #32
JanTvrdik wants to merge 1 commit intonette:master from
JanTvrdik:url_unescape_performance

Conversation

@JanTvrdik

Copy link

Contributor

@JanTvrdik JanTvrdik commented Dec 18, 2014

Currently unescape() can be used to perform DoS on Nette application if HTTP server does not restrict URL length.

Benchmark

const MAX_LENGTH = 256 * 1024;
$datasets = [
 'a' => ['%66%6F%6F', '', 'foo'],
 'b' => ['%66%6F%6F', 'o', 'f%6F%6F'],
 'c' => ['%66%6F%6F', 'f', '%66oo'],
 'd' => ['%66%6F%6F', 'fo', '%66%6F%6F'],
 'e' => ['%66%6f%6f', 'fo', '%66%6f%6f'],
 'no-escapes' => [str_repeat('a', MAX_LENGTH), '', str_repeat('a', MAX_LENGTH)],
 'escapes' => [str_repeat('%40', MAX_LENGTH / 3), '', str_repeat('@', MAX_LENGTH / 3)],
 'escapes2' => [str_repeat('%40', MAX_LENGTH / 6 - 3) . '%20' . str_repeat('%40', MAX_LENGTH / 6 - 3), '', str_repeat('@', MAX_LENGTH / 6 - 3) . '%20' . str_repeat('@', MAX_LENGTH / 6 - 3)],
 'escapes3' => [str_repeat('%40', MAX_LENGTH / 3), '@', str_repeat('%40', MAX_LENGTH / 3)],
 'escapes4' => [str_repeat('%40X', MAX_LENGTH / 4), '@', str_repeat('%40X', MAX_LENGTH / 4)],
 'escapes5' => [str_repeat('%40X%41Y', MAX_LENGTH / 8), '@', str_repeat("%40XAY", MAX_LENGTH / 8)],
];
$tests = [
 'old' => function ($testCount, $s, $reserved, $expected) {
 for ($i = 0; $i < $testCount; $i++) {
 $actual = $s;
 preg_match_all('#(?<=%)[a-f0-9][a-f0-9]#i', $s, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);
 foreach (array_reverse($matches) as $match) {
 $ch = chr(hexdec($match[0][0]));
 if (strpos($reserved, $ch) === FALSE) {
 $actual = substr_replace($actual, $ch, $match[0][1] - 1, 3);
 }
 }
 }
 Assert::same($expected, $actual);
 },
 'new' => function ($testCount, $s, $reserved, $expected) {
 for ($i = 0; $i < $testCount; $i++) {
 $actual = $s;
 preg_match_all('#%[a-f0-9][a-f0-9]#i', $s, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER);
 foreach (array_reverse($matches) as $match) {
 $ch = chr(hexdec(substr($match[0][0], 1)));
 if (strpos($reserved, $ch) === FALSE) {
 $actual = substr_replace($actual, $ch, $match[0][1], 3);
 }
 }
 }
 Assert::same($expected, $actual);
 },
 'split' => function ($testCount, $s, $reserved, $expected) {
 for ($i = 0; $i < $testCount; $i++) {
 if ($reserved === '') {
 $actual = rawurldecode($s);
 } else {
 $pattern = '#((?:%(?:' . implode('|', str_split(bin2hex($reserved), 2)) . '))++)#i';
 $parts = preg_split($pattern, $s, -1, PREG_SPLIT_DELIM_CAPTURE);
 $max = count($parts);
 for ($j = 0; $j < $max; $j += 2) {
 $parts[$j] = rawurldecode($parts[$j]);
 }
 $actual = implode('', $parts);
 }
 }
 Assert::same($expected, $actual);
 },
 'split-limited' => function ($testCount, $s, $reserved, $expected) {
 for ($i = 0; $i < $testCount; $i++) {
 if ($reserved === '') {
 $actual = rawurldecode($s);
 } else {
 $actual = $s;
 $pattern = '#((?:%(?:' . implode('|', str_split(bin2hex($reserved), 2)) . '))++)#i';
 $res = array();
 do {
 $parts = preg_split($pattern, $actual, 1024, PREG_SPLIT_DELIM_CAPTURE);
 $actual = isset($parts[2046]) ? $parts[2046] : '';
 unset($parts[2046]);
 $max = count($parts);
 for ($j = 0; $j < $max; $j += 2) {
 $parts[$j] = rawurldecode($parts[$j]);
 }
 $res[] = implode('', $parts);
 } while ($actual !== '');
 $actual = implode('', $res);
 }
 }
 Assert::same($expected, $actual);
 },
 'replace_callback' => function ($testCount, $s, $reserved, $expected) {
 for ($i = 0; $i < $testCount; $i++) {
 if ($reserved === '') {
 $actual = rawurldecode($s);
 } else {
 $pattern = '#(?:%(?!' . implode('|', str_split(bin2hex($reserved), 2)) . ')[0-9a-f]{2})++#i';
 $actual = preg_replace_callback(
 $pattern,
 function ($m) { return rawurldecode($m[0]); },
 $s
 );
 }
 }
 Assert::same($expected, $actual);
 },
];
$testCount = 10;
$padLength = max(array_map('strlen', array_keys($datasets))) + 3;
set_time_limit(0);
ini_set('memory_limit', '512M');
foreach ($tests as $testName => $testCallback) {
 printf("Test %s:\n", $testName);
 foreach ($datasets as $datasetName => $dataset) {
 $time = -microtime(TRUE);
 $testCallback($testCount, ...$dataset);
 $time += microtime(TRUE);
 printf(" %s%5.0f ms\n", str_pad($datasetName, $padLength), $time * 1e3);
 }
 printf("\n");
}

Results:

Test old:
 a 0 ms
 b 0 ms
 c 0 ms
 d 0 ms
 e 0 ms
 no-escapes 84 ms
 escapes 11424 ms
 escapes2 12724 ms
 escapes3 2238 ms
 escapes4 1685 ms
 escapes5 8341 ms
Test new:
 a 0 ms
 b 0 ms
 c 0 ms
 d 0 ms
 e 0 ms
 no-escapes 1 ms
 escapes 14472 ms
 escapes2 15174 ms
 escapes3 2349 ms
 escapes4 1748 ms
 escapes5 13632 ms
Test split:
 a 0 ms
 b 0 ms
 c 1 ms
 d 0 ms
 e 0 ms
 no-escapes 4 ms
 escapes 71 ms
 escapes2 118 ms
 escapes3 29 ms
 escapes4 670 ms
 escapes5 375 ms
Test split-limited:
 a 0 ms
 b 0 ms
 c 0 ms
 d 0 ms
 e 0 ms
 no-escapes 3 ms
 escapes 71 ms
 escapes2 117 ms
 escapes3 30 ms
 escapes4 524 ms
 escapes5 314 ms
Test replace_callback:
 a 0 ms
 b 0 ms
 c 0 ms
 d 1 ms
 e 0 ms
 no-escapes 3 ms
 escapes 72 ms
 escapes2 107 ms
 escapes3 50 ms
 escapes4 36 ms
 escapes5 309 ms

@JanTvrdik JanTvrdik force-pushed the url_unescape_performance branch 5 times, most recently from 5692048 to 60209fc Compare

December 18, 2014 12:16

@fprochazka

Copy link

Contributor

fprochazka commented Dec 18, 2014

Could you also put those cases from the description in the test file?

@JanTvrdik

Copy link

Contributor Author

JanTvrdik commented Dec 18, 2014

You mean the ugly long ones?

Another thing to consider: Url::unescape is used by Url::canonicalize however currently Url::canonicalize does not unify %aa and %AA. We may modify Url::unescape to this for us, e.g.

for ($i = 1; $i < count($parts); $i += 2) {
 $parts[$i] = strtoupper($parts[$i]);
}

The performance impact if we choose to do this:

Test split with strtoupper:
 a 0 ms
 b 0 ms
 c 0 ms
 d 0 ms
 no-escapes (a) 1 ms
 no-escapes (z) 1 ms
 escapes 9 ms
 escapes2 13 ms
 escapes3 11 ms
 escapes4 156 ms

@fprochazka

Copy link

Contributor

fprochazka commented Dec 18, 2014

Yeah, I think the extreme cases should also be there.

It's still a win :) Much better than before

@JanTvrdik

Copy link

Contributor Author

JanTvrdik commented Dec 18, 2014

Btw. I still don't understand the time complexity of those implementations. If I change in the escapes2 dataset the length from 1e5 / 2 to 1e5 (ie. make it twice longer), the new and old tests would became 10 times slower. Maybe I exceed some internal buffer size or I don't know.

It also means that if your web server does not refuse 600KB long URL, you are currently pretty much screwed.

Edit: 1e5 * 2 takes over 120s and a lot over of memory as well (I had to increase the limit from 128M to 512M).

Note that my implementation (split) is still vulnerable to escape4 input. For 2MB URL it takes about 3 second and 400 MB of memory. We should consider setting a hard limit on URL length to avoid this vulnerability because I don't know how to make it any more efficient.

I though about it a bit more and the memory could be optimized by using $limit parameter of preg_split, e.g.

if ($reserved === '') {
 $s = rawurldecode($s);
} else {
 $pattern = '#((?:%(?:' . implode('|', str_split(bin2hex($reserved), 2)) . '))++)#i';
 $res = '';
 do {
 $parts = preg_split($pattern, $s, 1024, PREG_SPLIT_DELIM_CAPTURE);
 $s = isset($parts[2046]) ? $parts[2046] : '';
 unset($parts[2046]);
 for ($j = 0; $j < count($parts); $j += 2) {
 $parts[$j] = rawurldecode($parts[$j]);
 }
 $res .= implode('', $parts);
 } while ($s !== '');
 $s = $res;
}

This makes it for the 2MB worst-case input (escape4) about 2 times slower but the memory consumption is reduces to from 400 MB to 15 MB. Still hard limit seems like the best option. Any tips where to draw the line?

I'll continue talking to myself. I don't think that we should ever exceed 1 second runtime and 10 MB memory usage on this. Therefore the hard limit on URL length should be 256 KiB (which takes 71 ms and 10 MB of memory).

@JanTvrdik JanTvrdik force-pushed the url_unescape_performance branch 2 times, most recently from 9426286 to cab8785 Compare

December 18, 2014 14:28

@JanTvrdik

Copy link

Contributor Author

JanTvrdik commented Dec 18, 2014

What I love about improving performance is that after you finished one fast implementation – you got an idea how to make it even faster

for ($i = 0; $i < $testCount; $i++) {
 if ($reserved === '') {
 $s = rawurldecode($s);
 } else {
 $pattern = '#(?:%(?!' . implode('|', str_split(bin2hex($reserved), 2)) . ')[0-9a-f][0-9a-f])++#i';
 $s = preg_replace_callback(
 $pattern,
 function ($m) { return rawurldecode($m[0]); },
 $s
 );
 }
}

@xificurk

Copy link

xificurk commented Dec 18, 2014

I don't think it should be Nette's responsibility to limit the URL length. All of the major webservers (apache, nginx, IIS) have by default a limit on URL length, all in the range of couple of kilobytes, therefore your suggested hard limit of 256 KiB is much higher than the defaults. And I assume that if someone changes these defaults then he knows what he's doing.

@JanTvrdik JanTvrdik force-pushed the url_unescape_performance branch from cab8785 to f4c06de Compare

December 18, 2014 15:15

@JanTvrdik

Copy link

Contributor Author

JanTvrdik commented Dec 18, 2014

hard limit of 256 KiB is much higher than the defaults

That is intentionally, this limit should be never reached. The question is whether we should rely server configuration or whether Nette should enforce its own limit (large enough to not affect any non-dos requests) as a security fallback.

@JanTvrdik JanTvrdik force-pushed the url_unescape_performance branch from f4c06de to 8805236 Compare

December 18, 2014 15:42

@JanTvrdik


 Url: optimized unescape() performance

b05aa50

@JanTvrdik JanTvrdik force-pushed the url_unescape_performance branch from 8805236 to b05aa50 Compare

December 18, 2014 15:48

@fprochazka

Copy link

Contributor

fprochazka commented Dec 18, 2014

If I'm not mistaken, making it 256KiB means having the limit ~128x larger than what a browser can handle. I have no problem with such a limit.

@JanTvrdik

Copy link

Contributor Author

JanTvrdik commented Dec 18, 2014

@fprochazka See http://technomanor.wordpress.com/2012/04/03/maximum-url-size/, every browser except for IE can handle URL with 100 KiB in size.

@fprochazka

Copy link

Contributor

fprochazka commented Dec 18, 2014

@JanTvrdik good to know, In that case, making it 256KiB is still a good limit. I would make it a static property (I can't believe I'm suggesting this) or an argument in that function, so it can be changed if neccesary.

@JanTvrdik

Copy link

Contributor Author

JanTvrdik commented Dec 18, 2014

@fprochazka If we choose to not trust the web servers and enforce the limit, it should certainly be in RequestFactory and handle the same way as any other invalid input, i.e. the same way as #30.

Therefore this can be merged regardless of whether we choose to impose a limit on the URL length. cc @dg

@fprochazka

Copy link

Contributor

fprochazka commented Dec 18, 2014

In that case, this optimalization should be solved by it's own and merged. And new issue for the limit should be created. And after the issues we're discussing in #30 are resolved, then we can add the limit.

@dg

Copy link

Member

dg commented Dec 18, 2014

Maybe micro-optimization for empty $reserved is unnecessary (is not used by Nette) and can be replaced with another micro-optimization: elimination of "slow" ternary operator in PHP 5.3.

@JanTvrdik

Copy link

Contributor Author

JanTvrdik commented Dec 18, 2014

Its not so much a micro-optimization as that the code can not handle empty $reserved. If you see a way how to make the code work with empty $reserved we can remove the condition.

Regarding framework currently not using empty $reserved – I was thinking about changing the second line in canonicalize() to use unescape to improve consistency with the other two lines.

@dg

Copy link

Member

dg commented Dec 18, 2014

I didn't realize that it will not work with empty $reserved.

What about unify %aa and %AA?

(It can be maybe done by replacing all reserved chars %aa with %25AA and converting whole result with rawurldecode)

@JanTvrdik

Copy link

Contributor Author

JanTvrdik commented Dec 18, 2014

Regarding unification – That could be easily done with the split algorithm but it is much more memory intensive than replace_callback and split-limited is slower and still uses a lot more memory than replace_callback.

The only way I currently see with replace_callback is a second call to preg_replace_callback.

It could be done with a single call but it is still slow:

$xx = implode('|', str_split(bin2hex($reserved), 2));
$actual = preg_replace_callback(
 '#(?:(?:%(?!' . $xx . ')[0-9a-f]{2})++)|((?:%(?:' . $xx . '))++)#i',
 function ($m) {
 return !empty($m[1]) ? strtoupper($m[0]) : rawurldecode($m[0]);
 },
 $s
);

@dg

Copy link

Member

dg commented Dec 21, 2014

@JanTvrdik this way (#32 (comment))

 if ($reserved !== '') {
 $s = preg_replace_callback(
 '#%(' . substr(chunk_split(bin2hex($reserved), 2, '|'), 0, -1) . ')#i', 
 function($m) { return '%25' . strtoupper($m[1]); }, 
 $s
 );
 }
 $s = rawurldecode($s);

@dg dg mentioned this pull request

Dec 21, 2014

unescape optimized #38

Merged

@JanTvrdik

Copy link

Contributor Author

JanTvrdik commented Dec 21, 2014

Good job! I didn't thought about double escaping.

@dg dg closed this in #38

Dec 21, 2014

@JanTvrdik JanTvrdik deleted the url_unescape_performance branch

December 21, 2014 22:31

Uh oh!

Comments

Conversation

@JanTvrdik JanTvrdik commented Dec 18, 2014

Uh oh!

fprochazka commented Dec 18, 2014

Uh oh!

JanTvrdik commented Dec 18, 2014

Uh oh!

fprochazka commented Dec 18, 2014

Uh oh!

JanTvrdik commented Dec 18, 2014

Uh oh!

JanTvrdik commented Dec 18, 2014

Uh oh!

xificurk commented Dec 18, 2014

Uh oh!

JanTvrdik commented Dec 18, 2014

Uh oh!

fprochazka commented Dec 18, 2014

Uh oh!

JanTvrdik commented Dec 18, 2014

Uh oh!

fprochazka commented Dec 18, 2014

Uh oh!

JanTvrdik commented Dec 18, 2014

Uh oh!

fprochazka commented Dec 18, 2014

Uh oh!

dg commented Dec 18, 2014

Uh oh!

JanTvrdik commented Dec 18, 2014

Uh oh!

dg commented Dec 18, 2014

Uh oh!

JanTvrdik commented Dec 18, 2014

Uh oh!

dg commented Dec 21, 2014

Uh oh!

JanTvrdik commented Dec 21, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants