-
Notifications
You must be signed in to change notification settings - Fork 483
Why does iteration with bytes::Regex yield empty matches that can split a codepoint, even when Unicode mode is enabled?
#1276
-
What version of regex are you using?
1.11.1
Describe the bug at a high level.
When using regex::bytes with unicode mode enabled (https://docs.rs/regex/latest/regex/bytes/struct.RegexBuilder.html#method.unicode), iterating over matches does not respect unicode character boundaries, but instead iterates over the raw bytes.
What are the steps to reproduce the behavior?
let re = regex::bytes::RegexBuilder::new(r"").unicode(true).build().unwrap(); let subject = "😃".as_bytes(); // I.e. U+1F603 assert_eq!(subject, b"\xF0\x9F\x98\x83"); // 4 UTF-8 bytes for m in re.find_iter(subject ) { println!("{:?}", m); } let res = re.replace_all(subject, b"<0ドル>"); println!("{}", String::from_utf8_lossy(&res));
What is the actual behavior?
The above prints
Match { start: 0, end: 0, bytes: "" }
Match { start: 1, end: 1, bytes: "" }
Match { start: 2, end: 2, bytes: "" }
Match { start: 3, end: 3, bytes: "" }
Match { start: 4, end: 4, bytes: "" }
<>�<>�<>�<>�<>
In other words, both find_iter and replace_all are operating on the individual byte level and not the UTF-8 character level.
What is the expected behavior?
I expected the above to print:
Match { start: 0, end: 0, string: "" }
Match { start: 4, end: 4, string: "" }
<>😃<>
which is exactly what happens when I use a regex::RegexBuilder instead of a regex::bytes::RegexBuilder.
If I change the regex from "" to ".', both work properly:
Match { start: 0, end: 4, string: "😃" }
<😃>
You may just tell me "don't use regex::bytes", but that is not a solution if what i'm matching over has mixed valid and invalid UTF-8, whereas '.' works correctly there.
Beta Was this translation helpful? Give feedback.
All reactions
This is very intentional behavior, although I note that it isn't documented. A great deal of care and attention is paid to this in the implementation inside of regex-automata. Notably, the semantics you want are called "UTF-8 mode," which is distinct and orthogonal from Unicode mode. UTF-8 mode has to do with guaranteeing that match spans always fall on UTF-8 boundaries. Unicode mode has to do with whether the regex pattern has Unicode features available. For example, see:
And special handling of empty matches when UTF-8 mode is enabled versus not:
regex/regex-automata/src/hybrid/dfa.rs
Lines 588 to 618 in 1a069b9
Replies: 2 comments 2 replies
-
I noticed this issues is fixed on the similar rust-pcre2 crate by this pull request BurntSushi/rust-pcre2#36. So maybe you can do something similar?
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm not convinced that PR is correct. It at the very least has unspecified behavior when UTF mode is enabled and the haystack isn't valid UTF-8.
Beta Was this translation helpful? Give feedback.
All reactions
-
This is very intentional behavior, although I note that it isn't documented. A great deal of care and attention is paid to this in the implementation inside of regex-automata. Notably, the semantics you want are called "UTF-8 mode," which is distinct and orthogonal from Unicode mode. UTF-8 mode has to do with guaranteeing that match spans always fall on UTF-8 boundaries. Unicode mode has to do with whether the regex pattern has Unicode features available. For example, see:
And special handling of empty matches when UTF-8 mode is enabled versus not:
regex/regex-automata/src/hybrid/dfa.rs
Lines 588 to 618 in 1a069b9
And then finally, you can read this module dedicated to handling this case for all of the engines inside of regex-automata:
If you read the above, you'll notice that enabling UTF-8 mode while providing a haystack that is invalid UTF-8 results in unspecified behavior. Specifically, the reason for unspecified behavior is that the "is char boundary" predicate has unspecified behavior:
regex/regex-automata/src/util/utf8.rs
Lines 117 to 137 in 1a069b9
I don't know if you require UTF-8 mode on a haystack that is invalid UTF-8, but if so, it requires reckoning with what it means to be a codepoint boundary on arbitrary byte sequences. The unspecified behavior in regex-automata may be acceptable to you. In which case, you can use meta::Regex directly.
Beta Was this translation helpful? Give feedback.
All reactions
-
Wow thanks for the detailed reply.
For anyone else who comes across this problem, I found more discussion here #484
(which basically makes my initial issue report a duplicate).
The documentation should really be updated to clarify this.
To quote https://docs.rs/regex/latest/regex/index.html#unicode:
- The top-level Regex runs searches as if iterating over each of the codepoints in the haystack. That is, the fundamental atom of matching is a single codepoint.
- bytes::Regex, in contrast, permits disabling Unicode mode for part of all of your pattern in all cases. When Unicode mode is disabled, then a search is run as if iterating over each byte in the haystack. That is, the fundamental atom of matching is a single byte. (A top-level Regex also permits disabling Unicode and thus matching as if it were one byte at a time, but only when doing so wouldn’t permit matching invalid UTF-8.)
Emphasis added. To me that says that when unicode mode is enabled, it will iterate over unicode characters, not single bytes.
Beta Was this translation helpful? Give feedback.