Are there ways to gain in *runtime* memory without losing too much on performances? · rust-lang/regex · Discussion #1206

masklinn
Jun 23, 2024

I've finally done some very basic benching regex-filtered against FilteredRE2: ua-parser/uap-rust@1af15c1

After fixing a dumb mistake it's basically on par in terms of runtime (and that's hot even its final form).

However there's an other standout issue: the RSS / memory footprint of the regex-filtered version is much higher than that of the re2 version, and simple tracking of the rss over phases using memory-stats shows it is largely attributable to regex, this was confirmed by just loading all the regexes into a Vec<Regex> and finding that the RSS climbs from ~2MB (after loading the arguments and reading the 633 regex strings from the file) to ~100MB (and then a bit further to ~140MB when matching, I assume that's because caches come into play then).

I looked at the different features of regex, and the only knob which seemed to influence RSS is dfa, which reduces RSS down to ~50MB before matching which is similar to re2 (though caches then tack ~ 20MB on when matching occurs). Sadly, it's also very much load bearing and disabling it makes runtime shoot up from 46 seconds to 157 seconds on my machine (for 100 iterations over 75158 strings, using 633 regexes).

And so I wanted to know, is there a knob I missed? Maybe something wrong I did with regex::bytes or something? Or is it the tradeoff? I assume regex-lite would also be an option but wouldn't really be competitive with re2?

In fairness I can't say it's of major importance, the runtime seems a lot more relevant, but it's useful information to stash away given the use case for regex-filtered would routinely be to load hundreds to thousands of regexes and it might inform e.g. lazy-compiling regexes (under the assumption that most of them will never actually be needed).

Replies: 1 comment 12 replies

BurntSushi
Jun 23, 2024
Maintainer

After fixing a dumb mistake it's basically on par in terms of runtime (and that's hot even its final form).

Out of curiosity, why not just implement the sparse set yourself? It's not a lot of code and it's pretty simple.

As to your question... It's hard to say. If you could show a MRE Rust program that uses a lot more memory than an equivalent C++ program using RE2, then I could likely give you a better answer.

Otherwise, the main knob to turn is probably to disable Unicode mode. The regex crate and RE2 have different support for Unicode. While there's a lot of differences in the details, probably the biggest difference is that the regexes \w, \d and \s are all Unicode-aware in Rust but specifically ASCII-only in RE2. This means that things like \w{100} are a lot bigger in Rust than in RE2.

Also, in a multi-threaded environment, I would generally expect the regex crate to use more memory than RE2. RE2's lazy DFA makes use one of one shared cache across all threads for a single lazy DFA, where as the regex crate uses one cache per thread. But if you aren't measuring memory in a multi-threaded benchmark, this difference shouldn't come up.

There could also be memory usage related bugs. For example, I haven't looked at #1116 yet.

Otherwise it's very difficult to answer your question in the abstract. I would really want an MRE to give a more precise answer. (For example, maybe there's something different about how the NFAs are represented? Or maybe there's something causing a prefilter somewhere to be bigger than one might expect?) In general, there is a lot of pressure on runtime performance in the form of benchmarks, but very little on memory usage. So there could very well be gains to be had here.

12 replies

@masklinn

masklinn Jul 4, 2024
Author

I saw a reply here by email but don't see it here.

Yeah sorry I posted an early version of the comment but figured I'd forgotten to take note of the old "accumulative" version of the bench script, so I went to rewrite it before posting it again.

I'm on mobile, but I would suggest wrapping the global allocator if you want more precise measurements. Otherwise you're probably just going to be doing battle with the optimizations performed by the allocator itself. See how dtolnay did it here: #1059 (comment)

It's my understanding that that's what jemalloc stats are supposed to surface. But I can always try using a more simplistic proxy allocator, given dtolnay posted the entire thing. I'll give that a shot.

@masklinn

masklinn Jul 6, 2024
Author

I saw a reply here by email but don't see it here. I'm on mobile, but I would suggest wrapping the global allocator if you want more precise measurements. Otherwise you're probably just going to be doing battle with the optimizations performed by the allocator itself.

Dammit from the documentation (and friends' suggestions) I assumed jemalloc's stats.allocated would be... the allocated stuff. But apparently it's doing weird things behind the scenes indeed. Custom allocators do seem to be a lot more reliable.

I put up https://github.com/masklinn/remem which provides for the regexes dataset and test scripts. I tried to get the evaluation / testing programs similar enough that comparisons mostly hold, but I'll readily admit that my C++ is really far from good. Also some of the behaviours I don't understand (e.g. in some cases regex needs no allocations for captures, in others it does, and I don't know what the difference is).

Anyway setting asides the unicode stuff, it looks like:

regex has a much higher baseline, even comparing regex::bytes with (?-u) to re2 (utf8 and latin1)
bounded repetitions (e.g. {1, 100} rather than +) has a really big impact, especially combined with captures in unicode mode

e.g. two of the initial entries are:

re2 utf8	re2 latin1	regex unicode/string	regex ascii/string	regex ascii/bytes
compile	capture	compile	capture	compile	capture	compile	capture	compile	capture	expression
2200	1876	2080	1620	16394	0	-	-	8466	0	(Dar)(win)/(10).(\d+).*((?:i386
9248	16532	5076	6772	141946	0	-	-	21626	0	(Dar)(win)/(11).(\d+).{0,100}((?:i386

re2 in latin/ascii mode has a 2x multiplier from * to {0, 100}
re2 in utf8 mode is 4x
regex in ascii/bytes mode is 2.5x
regex in unicode/string mode is 8.5x

With a simpler case (and ignoring the capture memory)

.*	(.*)	.{0,500}	(.{0,500})
re2 - utf8	808	896 (x1.1)	92636 (x115)	92692 (x115)
re2 - ascii	648	720 (x1.11)	18620 (x29)	18676 (x29)
regex - unicode	7784	11944 (x1.5)	655568 (x84)	1704216 (x219)
regex - binascii	6160	6544 (x1.05)	70896 (x11.5)	87352 (x16)

On this one specifically regex actually does quite well on repetitions except for the repetition + capture unicode version, however since its baseline is 10x that of re2 it's still quite a ways behind.

@masklinn

masklinn Jul 9, 2024
Author

For what it's worth, although the upstream of what I'm using regex-filtered for does not specify their regex semantics the reference implementation is in javascript which uses strict ASCII for their perl-style character classes, and as you wrote up in the initial answer re2 also documents that they use strict ASCII semantics for the same. So while not touching regex-filtered I added a downgrade step to the sub-project, which converts \d, \D, \w, and \W to the ASCII sets documented by MDN / re2 (\s/\S doesn't seem significantly affected by the issue). Thanks much for regex supporting nested enumerated character classes incidentally, that made the conversion a lot less fraught as I need to care a lot less about context for this conversion.

Through log-diving I also confirmed that the extensive use of bounded repetitions is not actually to bound repetitions but to mitigate catastrophic backtracking concerns, so while that's a lot iffier the downgrade filter also converts "large" bounded repetitions to unbounded (as it is my understanding that finite automata engines like regex are not affected).

This results in a ~50% reduction in peak memory footprint on processing a sample file, and despite the additional prefiltering (and allocations) a ~10% reduction in runtime.

Feel free to close this discussion if you don't see a point to keeping it open.

PS: no idea what could be done about the amplification caused by bounded repetitions and captures, but maybe RegexBuilder could grow a flag to make perl-style character class work off of ascii semantics? That would make equivalent conversions (and honest but misguided comparisons) from engines with lesser unicode support slightly easier not to screw up. I know that regex supports ascii-only classes via the POSIX-style classes, but Perl-style classes seem better supported (e.g. neither JS nor Python's re support POSIX-style). For prior art, Python has re.ASCII which gimps perl-style character classes down to ASCII semantics.

I know that regex::bytes can be used for some of it, because it only works on bytes in textual contexts it's a bit of a downgrade.

@BurntSushi

BurntSushi Jul 9, 2024
Maintainer

With respect to Perl style classes being ASCII, that's already supported: just disable Unicode mode. Either globally, or locally within the regex: (?-u:\w) is equivalent to [[:word:]]. Perhaps what you want is a mode that only downgrades the Perl style classes to ASCII and leaves everything else be. I'm unsure about that personally.

The simplifications you made to your regexes sound good to me. I had no idea that the bounds were added not for semantics, but to deal with backtrackers. That is interesting!

I do think a more careful analysis of the memory footprint differences between this crate and RE2 is warranted, but I don't currently consider it higher priority.

@masklinn

masklinn Jul 10, 2024
Author

With respect to Perl style classes being ASCII, that's already supported: just disable Unicode mode. Either globally, or locally within the regex: (?-u:\w) is equivalent to [[:word:]]. Perhaps what you want is a mode that only downgrades the Perl style classes to ASCII and leaves everything else be.

Yes, that is indeed what I meant as the upstream in this cases uses . a fair bit (and in general I assume people would commonly find that in existing regexes for projects they're trying to move over to regex). I don't know if there are other metacharacters with that issue in the set, but I know that one has tripped me pretty much every time I've tried to use the -u with the string version of the API.

I had no idea that the bounds were added not for semantics, but to deal with backtrackers. That is interesting!

Yeah I had no idea either, I only found out when I tried to log dive to see if I could find explanations of the semantics requirements, and stumbled on a commit converting a bunch of unbounded repetitions to bounded ones.

Are there ways to gain in *runtime* memory without losing too much on performances? #1206

Uh oh!

masklinn Jun 23, 2024

Replies: 1 comment · 12 replies

Uh oh!

BurntSushi Jun 23, 2024 Maintainer

Uh oh!

masklinn Jul 4, 2024 Author

Uh oh!

Uh oh!

masklinn Jul 6, 2024 Author

Uh oh!

masklinn Jul 9, 2024 Author

Uh oh!

BurntSushi Jul 9, 2024 Maintainer

Uh oh!

masklinn Jul 10, 2024 Author

Are there ways to gain in runtime memory without losing too much on performances? #1206

masklinn
Jun 23, 2024

Replies: 1 comment 12 replies

BurntSushi
Jun 23, 2024
Maintainer

masklinn Jul 4, 2024
Author

masklinn Jul 6, 2024
Author

masklinn Jul 9, 2024
Author

BurntSushi Jul 9, 2024
Maintainer

masklinn Jul 10, 2024
Author