How can I build a regex pattern with sequences of non utf-8 bytes? · rust-lang/regex · Discussion #1253

johomo
Feb 13, 2025

First of all, thank you all for maintaining such a great crate 🫶

Now, straight to the point.

The signature of the regex::bytes::Regex::new takes a regular expression as a &str. Also, str must be valid utf-8. This effecively means that Regex cannot be used to find a sequence of utf8-invalid bytes in a haystack of bytes.

As an example, the following snippet attempts to find a needle b"\x9c\xe5" (which is not valid utf8) in a haystack of bytes.

use regex::bytes::RegexBuilder;
fn main() {
 let hay: &[u8] = b"The following bytes are not UTF8 valid: \x9c\xe5";
 
 let pattern_as_bytes: &[u8] = b"\x9c\xe5";
 
 let re = RegexBuilder::new(pattern_as_bytes).unicode(false).build().expect("Invalid regex pattern");
 // ----------------- ^^^^^^^^^^^^^^^^^ expected `&str`, found `&[u8]`
 // |
 // arguments to this function are incorrect
 // Find all occurrences of the pattern in the text
 for mat in re.find_iter(hay) {
 println!("Found match at: {:?}", mat);
 }
}

For the sake of completness, this exact example can be accomplished with memchr::memmem::find_iter.

My question is: What is the motivation to only allow patterns as instances of &str?

To be honest, I'm not sure if it makes any sense to allow patterns as bytes. For instance, how would you write a pattern such as "match either bytes \x9c\xe5 or bytes b"bar""? Would it be br"(\x9c\xe5|bar)"? Does this even make sense?

I feel I am missing something big. That's why I'm really interested in your point of view as experts in this topic.

Thanks a lot for your time.

Answered by BurntSushi

Feb 13, 2025

This effecively means that Regex cannot be used to find a sequence of utf8-invalid bytes in a haystack of bytes.

That's incorrect. Here's a counter-example:

use regex::bytes::Regex;
fn main() {
 let haystack = b"foo bar\xFF baz";
 let re = Regex::new(r"(?-u)\xFF").unwrap();
 assert_eq!(re.find(haystack).map(|r| r.range()), Some(7..8));
}

Playground link.

I feel I am missing something big.

You might have missed this section in the docs: https://docs.rs/regex/latest/regex/bytes/index.html#syntax

The relevant sections there are:

The u flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the u flag is disabled, the regex is said to be i...

View full answer

Replies: 2 comments 3 replies

BurntSushi
Feb 13, 2025
Maintainer

This effecively means that Regex cannot be used to find a sequence of utf8-invalid bytes in a haystack of bytes.

That's incorrect. Here's a counter-example:

use regex::bytes::Regex;
fn main() {
 let haystack = b"foo bar\xFF baz";
 let re = Regex::new(r"(?-u)\xFF").unwrap();
 assert_eq!(re.find(haystack).map(|r| r.range()), Some(7..8));
}

Playground link.

I feel I am missing something big.

You might have missed this section in the docs: https://docs.rs/regex/latest/regex/bytes/index.html#syntax

The relevant sections there are:

The u flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the u flag is disabled, the regex is said to be in "ASCII compatible" mode.

and

Hexadecimal notation can be used to specify arbitrary bytes instead of Unicode codepoints. For example, in ASCII compatible mode, \xFF matches the literal byte \xFF, while in Unicode mode, \xFF is the Unicode codepoint U+00FF that matches its UTF-8 encoding of \xC3\xBF. Similarly for octal notation when enabled.

While the pattern itself has to be valid UTF-8, you can match arbitrary byte sequences using hex escapes when Unicode mode is disabled.

But yeah this is a good question! This design was absolutely intentional and it is very much an important point that arbitrary bytes can be matched even though a regex pattern itself has to be valid UTF-8.

In theory, the implementation could be changed to support patterns that aren't valid UTF-8. Or even changed to parse straight from a &[u8]. But in practice this is usually not advantageous and it's better for comprehensibility reasons to just require that the pattern be valid UTF-8.

2 replies

@johomo

johomo Feb 13, 2025
Author

Thanks for your quick reply, that was really helpful.

So, at the end of the day I was trying to build a regex pattern with sequences of bytes that may not be utf8 valid.
The point that I was missing was that sequences of literal bytes (not necessarily UTF-8 valid) in patterns written with literal strings r"" must be with hexadecimal notation.

So, if you have a sequence of bytes like this: let bytes: Vec<u8> = vec![0x9c, 0xe5];, they must converted into r"\x9c\xe5". For example, you can convert from Vec<u8> to a valid pattern in String using let bytes_as_hex = b.iter().map(|byte| format!(r"\x{:02x}", byte)).collect::<String>();.

For example:

let bytes: Vec<u8> = vec![0x9c, 0xe5];
let bytes_as_string: String = bytes.iter().map(|byte| format!(r"\x{:02x}", byte)).collect();
// Matches either bytes `\x9c\xe5` (not utf-8 valid) or bytes `b"bar"
let pattern = format!(r"(?-u)({}|bar)", bytes_as_string);

Shall I update the title of the thread to another one more suitable for people facing the same problem? Perhaps something like: How can I build a regex pattern with sequences of non utf-8 bytes?

Again, thanks a lot for your help!

@BurntSushi

BurntSushi Feb 13, 2025
Maintainer

Yeah that title sounds good! I've updated it. Thanks. :-)

Answer selected by BurntSushi

aapoalas
Aug 21, 2025

Hey, I'm using the regex crate to implement the ECMAScript RegExp type in Nova JavaScript engine; the engine uses WTF-8 as the internal representation for strings, so matching over arbitrary bytes is a fairly important thing. Thanks for enabling the feature!

I lack a bit of imagination as to how I should rewrite the ECMAScript RegExp notation (that a user's code will give me) to do arbitrary byte matching, though. An example I'm currently looking at is this:

var re68 = /^([#.]?)((?:[\w\u0128-\uffff*_-]|\\.)*)/;

Compiling this RegExp directly fails (correctly) with "Unicode not allowed here" for the [\w\u0128-\uffff*_-] group. I'm then left wondering how I could/should restate this kind of group in a way that the regex crate would accept. I assume I cannot actually use ranges to do that since the provided Unicode range decomposes into multiple WTF-8 bytes. Would the correct choice then be to split the unicode parts out of the character group and join them on the outside with |? eg. something like this [\w*_-]|[\xAB-CD][\xEF-\xGH]|...?

1 reply

@BurntSushi

BurntSushi Aug 21, 2025
Maintainer

Please open a new question and please include an MRE of what you're trying to do. This means you share inputs, actual output, desired output and the commands necessary for someone else to reproduce your output.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can I build a regex pattern with sequences of non utf-8 bytes? #1253

Uh oh!

{{title}}

Uh oh!

johomo
Feb 13, 2025

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

BurntSushi
Feb 13, 2025
Maintainer

Uh oh!

{{title}}

Uh oh!

johomo Feb 13, 2025
Author

Uh oh!

{{title}}

Uh oh!

BurntSushi Feb 13, 2025
Maintainer

Uh oh!

{{title}}

Uh oh!

aapoalas
Aug 21, 2025

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

BurntSushi Aug 21, 2025
Maintainer

Select a reply

Uh oh!

How can I build a regex pattern with sequences of non utf-8 bytes? #1253

Uh oh!

johomo Feb 13, 2025

Replies: 2 comments · 3 replies

Uh oh!

BurntSushi Feb 13, 2025 Maintainer

Uh oh!

johomo Feb 13, 2025 Author

Uh oh!

BurntSushi Feb 13, 2025 Maintainer

Uh oh!

aapoalas Aug 21, 2025

Uh oh!

Uh oh!

BurntSushi Aug 21, 2025 Maintainer

johomo
Feb 13, 2025

Replies: 2 comments 3 replies

BurntSushi
Feb 13, 2025
Maintainer

johomo Feb 13, 2025
Author

BurntSushi Feb 13, 2025
Maintainer

aapoalas
Aug 21, 2025

BurntSushi Aug 21, 2025
Maintainer