Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Why does iteration with bytes::Regex yield empty matches that can split a codepoint, even when Unicode mode is enabled? #1276

Answered by BurntSushi
IsaacOscar asked this question in Q&A
Discussion options

What version of regex are you using?

1.11.1

Describe the bug at a high level.

When using regex::bytes with unicode mode enabled (https://docs.rs/regex/latest/regex/bytes/struct.RegexBuilder.html#method.unicode), iterating over matches does not respect unicode character boundaries, but instead iterates over the raw bytes.

What are the steps to reproduce the behavior?

let re = regex::bytes::RegexBuilder::new(r"").unicode(true).build().unwrap();
let subject = "😃".as_bytes(); // I.e. U+1F603
assert_eq!(subject, b"\xF0\x9F\x98\x83"); // 4 UTF-8 bytes
for m in re.find_iter(subject ) {
 println!("{:?}", m); 
}
let res = re.replace_all(subject, b"<0ドル>");
println!("{}", String::from_utf8_lossy(&res));

What is the actual behavior?

The above prints

Match { start: 0, end: 0, bytes: "" }
Match { start: 1, end: 1, bytes: "" }
Match { start: 2, end: 2, bytes: "" }
Match { start: 3, end: 3, bytes: "" }
Match { start: 4, end: 4, bytes: "" }
<>�<>�<>�<>�<>

In other words, both find_iter and replace_all are operating on the individual byte level and not the UTF-8 character level.

What is the expected behavior?

I expected the above to print:

Match { start: 0, end: 0, string: "" }
Match { start: 4, end: 4, string: "" }
<>😃<>

which is exactly what happens when I use a regex::RegexBuilder instead of a regex::bytes::RegexBuilder.

If I change the regex from "" to ".', both work properly:

Match { start: 0, end: 4, string: "😃" }
<😃>

You may just tell me "don't use regex::bytes", but that is not a solution if what i'm matching over has mixed valid and invalid UTF-8, whereas '.' works correctly there.

You must be logged in to vote

This is very intentional behavior, although I note that it isn't documented. A great deal of care and attention is paid to this in the implementation inside of regex-automata. Notably, the semantics you want are called "UTF-8 mode," which is distinct and orthogonal from Unicode mode. UTF-8 mode has to do with guaranteeing that match spans always fall on UTF-8 boundaries. Unicode mode has to do with whether the regex pattern has Unicode features available. For example, see:

And special handling of empty matches when UTF-8 mode is enabled versus not:

#[inline]

Replies: 2 comments 2 replies

Comment options

I noticed this issues is fixed on the similar rust-pcre2 crate by this pull request BurntSushi/rust-pcre2#36. So maybe you can do something similar?

You must be logged in to vote
1 reply
Comment options

I'm not convinced that PR is correct. It at the very least has unspecified behavior when UTF mode is enabled and the haystack isn't valid UTF-8.

Comment options

This is very intentional behavior, although I note that it isn't documented. A great deal of care and attention is paid to this in the implementation inside of regex-automata. Notably, the semantics you want are called "UTF-8 mode," which is distinct and orthogonal from Unicode mode. UTF-8 mode has to do with guaranteeing that match spans always fall on UTF-8 boundaries. Unicode mode has to do with whether the regex pattern has Unicode features available. For example, see:

And special handling of empty matches when UTF-8 mode is enabled versus not:

#[inline]
pub fn try_search_fwd(
&self,
cache: &mut Cache,
input: &Input<'_>,
) -> Result<Option<HalfMatch>, MatchError> {
let utf8empty = self.get_nfa().has_empty() && self.get_nfa().is_utf8();
let hm = match search::find_fwd(self, cache, input)? {
None => return Ok(None),
Some(hm) if !utf8empty => return Ok(Some(hm)),
Some(hm) => hm,
};
// We get to this point when we know our DFA can match the empty string
// AND when UTF-8 mode is enabled. In this case, we skip any matches
// whose offset splits a codepoint. Such a match is necessarily a
// zero-width match, because UTF-8 mode requires the underlying NFA
// to be built such that all non-empty matches span valid UTF-8.
// Therefore, any match that ends in the middle of a codepoint cannot
// be part of a span of valid UTF-8 and thus must be an empty match.
// In such cases, we skip it, so as not to report matches that split a
// codepoint.
//
// Note that this is not a checked assumption. Callers *can* provide an
// NFA with UTF-8 mode enabled but produces non-empty matches that span
// invalid UTF-8. But doing so is documented to result in unspecified
// behavior.
empty::skip_splits_fwd(input, hm, hm.offset(), |input| {
let got = search::find_fwd(self, cache, input)?;
Ok(got.map(|hm| (hm, hm.offset())))
})
}

And then finally, you can read this module dedicated to handling this case for all of the engines inside of regex-automata:

https://github.com/rust-lang/regex/blob/1a069b9232c607b34c4937122361aa075ef573fa/regex-automata/src/util/empty.rs

If you read the above, you'll notice that enabling UTF-8 mode while providing a haystack that is invalid UTF-8 results in unspecified behavior. Specifically, the reason for unspecified behavior is that the "is char boundary" predicate has unspecified behavior:

/// Returns true if and only if the given offset in the given bytes falls on a
/// valid UTF-8 encoded codepoint boundary.
///
/// If `bytes` is not valid UTF-8, then the behavior of this routine is
/// unspecified.
#[cfg_attr(feature = "perf-inline", inline(always))]
pub(crate) fn is_boundary(bytes: &[u8], i: usize) -> bool {
match bytes.get(i) {
// The position at the end of the bytes always represents an empty
// string, which is a valid boundary. But anything after that doesn't
// make much sense to call valid a boundary.
None => i == bytes.len(),
// Other than ASCII (where the most significant bit is never set),
// valid starting bytes always have their most significant two bits
// set, where as continuation bytes never have their second most
// significant bit set. Therefore, this only returns true when bytes[i]
// corresponds to a byte that begins a valid UTF-8 encoding of a
// Unicode scalar value.
Some(&b) => b <= 0b0111_1111 || b >= 0b1100_0000,
}
}

I don't know if you require UTF-8 mode on a haystack that is invalid UTF-8, but if so, it requires reckoning with what it means to be a codepoint boundary on arbitrary byte sequences. The unspecified behavior in regex-automata may be acceptable to you. In which case, you can use meta::Regex directly.

You must be logged in to vote
1 reply
Comment options

Wow thanks for the detailed reply.
For anyone else who comes across this problem, I found more discussion here #484
(which basically makes my initial issue report a duplicate).

The documentation should really be updated to clarify this.
To quote https://docs.rs/regex/latest/regex/index.html#unicode:

  • The top-level Regex runs searches as if iterating over each of the codepoints in the haystack. That is, the fundamental atom of matching is a single codepoint.
  • bytes::Regex, in contrast, permits disabling Unicode mode for part of all of your pattern in all cases. When Unicode mode is disabled, then a search is run as if iterating over each byte in the haystack. That is, the fundamental atom of matching is a single byte. (A top-level Regex also permits disabling Unicode and thus matching as if it were one byte at a time, but only when doing so wouldn’t permit matching invalid UTF-8.)

Emphasis added. To me that says that when unicode mode is enabled, it will iterate over unicode characters, not single bytes.

Answer selected by BurntSushi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
Converted from issue

This discussion was converted from issue #1275 on August 05, 2025 14:04.

AltStyle によって変換されたページ (->オリジナル) /