Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

How can I build a regex pattern with sequences of non utf-8 bytes? #1253

Answered by BurntSushi
johomo asked this question in Q&A
Discussion options

First of all, thank you all for maintaining such a great crate 🫶

Now, straight to the point.

The signature of the regex::bytes::Regex::new takes a regular expression as a &str. Also, str must be valid utf-8. This effecively means that Regex cannot be used to find a sequence of utf8-invalid bytes in a haystack of bytes.

As an example, the following snippet attempts to find a needle b"\x9c\xe5" (which is not valid utf8) in a haystack of bytes.

use regex::bytes::RegexBuilder;
fn main() {
 let hay: &[u8] = b"The following bytes are not UTF8 valid: \x9c\xe5";
 
 let pattern_as_bytes: &[u8] = b"\x9c\xe5";
 
 let re = RegexBuilder::new(pattern_as_bytes).unicode(false).build().expect("Invalid regex pattern");
 // ----------------- ^^^^^^^^^^^^^^^^^ expected `&str`, found `&[u8]`
 // |
 // arguments to this function are incorrect
 // Find all occurrences of the pattern in the text
 for mat in re.find_iter(hay) {
 println!("Found match at: {:?}", mat);
 }
}

For the sake of completness, this exact example can be accomplished with memchr::memmem::find_iter.

My question is: What is the motivation to only allow patterns as instances of &str?

To be honest, I'm not sure if it makes any sense to allow patterns as bytes. For instance, how would you write a pattern such as "match either bytes \x9c\xe5 or bytes b"bar""? Would it be br"(\x9c\xe5|bar)"? Does this even make sense?

I feel I am missing something big. That's why I'm really interested in your point of view as experts in this topic.

Thanks a lot for your time.

You must be logged in to vote

This effecively means that Regex cannot be used to find a sequence of utf8-invalid bytes in a haystack of bytes.

That's incorrect. Here's a counter-example:

use regex::bytes::Regex;
fn main() {
 let haystack = b"foo bar\xFF baz";
 let re = Regex::new(r"(?-u)\xFF").unwrap();
 assert_eq!(re.find(haystack).map(|r| r.range()), Some(7..8));
}

Playground link.

I feel I am missing something big.

You might have missed this section in the docs: https://docs.rs/regex/latest/regex/bytes/index.html#syntax

The relevant sections there are:

The u flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the u flag is disabled, the regex is said to be i...

Replies: 2 comments 3 replies

Comment options

This effecively means that Regex cannot be used to find a sequence of utf8-invalid bytes in a haystack of bytes.

That's incorrect. Here's a counter-example:

use regex::bytes::Regex;
fn main() {
 let haystack = b"foo bar\xFF baz";
 let re = Regex::new(r"(?-u)\xFF").unwrap();
 assert_eq!(re.find(haystack).map(|r| r.range()), Some(7..8));
}

Playground link.

I feel I am missing something big.

You might have missed this section in the docs: https://docs.rs/regex/latest/regex/bytes/index.html#syntax

The relevant sections there are:

The u flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the u flag is disabled, the regex is said to be in "ASCII compatible" mode.

and

Hexadecimal notation can be used to specify arbitrary bytes instead of Unicode codepoints. For example, in ASCII compatible mode, \xFF matches the literal byte \xFF, while in Unicode mode, \xFF is the Unicode codepoint U+00FF that matches its UTF-8 encoding of \xC3\xBF. Similarly for octal notation when enabled.

While the pattern itself has to be valid UTF-8, you can match arbitrary byte sequences using hex escapes when Unicode mode is disabled.

But yeah this is a good question! This design was absolutely intentional and it is very much an important point that arbitrary bytes can be matched even though a regex pattern itself has to be valid UTF-8.

In theory, the implementation could be changed to support patterns that aren't valid UTF-8. Or even changed to parse straight from a &[u8]. But in practice this is usually not advantageous and it's better for comprehensibility reasons to just require that the pattern be valid UTF-8.

You must be logged in to vote
2 replies
Comment options

Thanks for your quick reply, that was really helpful.

So, at the end of the day I was trying to build a regex pattern with sequences of bytes that may not be utf8 valid.
The point that I was missing was that sequences of literal bytes (not necessarily UTF-8 valid) in patterns written with literal strings r"" must be with hexadecimal notation.

So, if you have a sequence of bytes like this: let bytes: Vec<u8> = vec![0x9c, 0xe5];, they must converted into r"\x9c\xe5". For example, you can convert from Vec<u8> to a valid pattern in String using let bytes_as_hex = b.iter().map(|byte| format!(r"\x{:02x}", byte)).collect::<String>();.

For example:

let bytes: Vec<u8> = vec![0x9c, 0xe5];
let bytes_as_string: String = bytes.iter().map(|byte| format!(r"\x{:02x}", byte)).collect();
// Matches either bytes `\x9c\xe5` (not utf-8 valid) or bytes `b"bar"
let pattern = format!(r"(?-u)({}|bar)", bytes_as_string);

Shall I update the title of the thread to another one more suitable for people facing the same problem? Perhaps something like: How can I build a regex pattern with sequences of non utf-8 bytes?

Again, thanks a lot for your help!

Comment options

Yeah that title sounds good! I've updated it. Thanks. :-)

Answer selected by BurntSushi
Comment options

Hey, I'm using the regex crate to implement the ECMAScript RegExp type in Nova JavaScript engine; the engine uses WTF-8 as the internal representation for strings, so matching over arbitrary bytes is a fairly important thing. Thanks for enabling the feature!

I lack a bit of imagination as to how I should rewrite the ECMAScript RegExp notation (that a user's code will give me) to do arbitrary byte matching, though. An example I'm currently looking at is this:

var re68 = /^([#.]?)((?:[\w\u0128-\uffff*_-]|\\.)*)/;

Compiling this RegExp directly fails (correctly) with "Unicode not allowed here" for the [\w\u0128-\uffff*_-] group. I'm then left wondering how I could/should restate this kind of group in a way that the regex crate would accept. I assume I cannot actually use ranges to do that since the provided Unicode range decomposes into multiple WTF-8 bytes. Would the correct choice then be to split the unicode parts out of the character group and join them on the outside with |? eg. something like this [\w*_-]|[\xAB-CD][\xEF-\xGH]|...?

You must be logged in to vote
1 reply
Comment options

Please open a new question and please include an MRE of what you're trying to do. This means you share inputs, actual output, desired output and the commands necessary for someone else to reproduce your output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /