1

I have a function capable of translating UTF-8 characters into a custom 8-bit encoding:

const fn char_to_custom_encoding(c: char) -> u8;

I'd like to apply it at compile time to a string literal so that only the translated array is stored and can be accessed instantly at runtime, with a signature along the lines of

const fn literal_to_custom_encoding(input: &'static str) -> &'static [u8]

The source string is expected to contain UTF-8 characters, so processing it naively byte-by-byte as if it was ASCII, like done in this blog post, is not an option, and str.chars() can not be used at compile time as it is not a const method.

So far, my best bet seems to be to employ the general approach described in the aforementioned blog post and then manually detect UTF-8 characters during iteration. However, aside from having to write a manual implementation when a built-in one exists, this will also lead to the resulting array being padded with zeros as multi-byte characters get converted into single values and don't fill it completely; this is made even more complicated by the crate being no_std, making returning a Vec nonviable even at runtime.

Is there a better approach to this problem that still ensures that the transformation is done at compile time? I've considered using macros, but it seems that would leave the problem of iterating over the UTF-8 characters largely unsolved (there's the const_format crate doing something similar internally, but I'm having a hard time understanding its code) and I'd have to reimplement the conversion function as a macro as well, making it completely impossible to use at runtime even if desired.

asked yesterday
3
  • 1
    A &'static [u8] doesn't make much sense as a return type, especially without alloc. What memory is it supposed to reference? Commented 23 hours ago
  • @cafce25 If the function is only allowed to be called at compile time (I could make a separate one for runtime transformation) then it could theoretically put the transformed data into a static array that's stored directly in the executable, kind of like a custom string literal. Commented 20 hours ago
  • 2
    But rust doesn't have compile time only functions. And you explicitly exclude a macro for not being able to be called at runtime. Commented 20 hours ago

2 Answers 2

1

The proc-macro required should really be quite simple - you just have to parse a string literal from the macro input and convert it to a byte string literal for the macro output (using the quote and syn crates):

#[proc_macro]
fn custom_encoding(input: TokenStream) -> TokenStream {
 if let syn::Lit::Str(s) = syn::parse_macro_input!(input as syn::Lit) {
 let encoded = s.value().chars().map(char_to_custom_encoding).collect::<Vec<u8>>();
 use quote::ToTokens as _;
 syn::LitByteStr::new(&encoded, s.span()).into_token_stream().into()
 } else {
 quote::quote! {
 compile_error!("expected string literal")
 }.into()
 }
}

Playground (modified from example code to work in the playground)

Using non-const and alloc is fine here, as the macro itself runs at compile time, and requires a separate crate anyway. Your crate structure will need to be somewhat complicated, looking something like the following:

├── custom_encoding // contains char_to_custom_encoding, no_std
│ ├── Cargo.toml
│ └── src
│ └── lib.rs
├── custom_encoding_macro // a proc-macro crate, contains custom_encoding! macro, depends on custom_encoding, uses std
│ ├── Cargo.toml
│ └── src
│ └── lib.rs
├── Cargo.toml // depends on custom_encoding and custom_encoding_macro, no_std
└── src
 ├── main.rs
 └── lib.rs
answered 6 hours ago
Sign up to request clarification or add additional context in comments.

1 Comment

I started writing this before OP submitted their answer, but I think it still stands as a possible solution to their problem.
0

I ended up writing a basic UTF-8 decoder for this purpose, as well as using a macro wrapper. The blog post linked in the question definitely helped, especially their later example of transcoding UTF-8 to UTF-16 at compile time

It seems that I've initially missed that the length calculation doesn't have to be implemented as a macro and could be a const function itself; this greatly simplified the implementation:

const fn decode_utf8_char(s: &str, start_pos: usize) -> (char, usize) {
 let s = s.as_bytes();
 let mut first_byte = s[start_pos];
 if first_byte.is_ascii() {
 (first_byte as char, 1usize)
 } else {
 // Use u32 for temporary values to carry out validation later
 let mut res = 0u32;
 let mut cont_byte_count = 0usize;
 while first_byte & 0x40 != 0 {
 cont_byte_count += 1;
 res = (res as u32) << 6 | (s[start_pos + cont_byte_count] & 0x3F) as u32;
 first_byte = first_byte << 1;
 }
 res = res | ((first_byte as u32 & 0x7F) << (cont_byte_count * 5));
 (
 char::from_u32(res).expect("Failed to decode a UTF-8 character"),
 cont_byte_count + 1,
 )
 }
}
const fn utf8_len(s: &str) -> usize {
 let mut length = 0usize;
 let mut input_pos = 0usize;
 while input_pos < s.len() {
 let (_, len) = decode_utf8_char(&s, input_pos);
 input_pos += len;
 length += 1;
 }
 length
}
const fn literal_to_custom_encoding_internal<const LEN: usize>(s: &str) -> [u8; LEN] {
 let mut output = [0u8; LEN];
 let mut input_pos = 0usize;
 let mut output_pos = 0usize;
 while input_pos < s.len() {
 let (char, len) = decode_utf8_char(&s, input_pos);
 output[output_pos] = char_to_custom_encoding(char);
 input_pos += len;
 output_pos += 1;
 }
 output
}
macro_rules! literal_to_custom_encoding {
 ($s:literal) => {{
 const STRING: [u8; utf8_len($s)] = literal_to_custom_encoding_internal::<{ utf8_len($s) }>($s);
 STRING
 }};
}

This can then be used as

literal_to_custom_encoding!("Hello, world!");
answered 7 hours ago

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.