3
\$\begingroup\$

From a system we receive messages that contain codes that represent UTF-8 characters.

For example :

var str="Test =64 =C2=AE =E1=A6=92 test";

To decode these codes to utf-8 I've added a simple function that does 3 regex replacements:

protected static string ReplaceHexCodesInString(string input)
{
 var output = input;
 var encoding = Encoding.UTF8;
 var regTripleHex = new Regex("=(E[0-9A-F])=([0-9A-F]{2})=([0-9A-F]{2})");
 output = regTripleHex.Replace(output, m => encoding.GetString(new[]{
 byte.Parse(m.Groups[1].Value, System.Globalization.NumberStyles.HexNumber),
 byte.Parse(m.Groups[2].Value, System.Globalization.NumberStyles.HexNumber),
 byte.Parse(m.Groups[3].Value, System.Globalization.NumberStyles.HexNumber)
 }));
 var regDoubleHex = new Regex("=([C-D][0-9A-F])=([0-9A-F]{2})");
 output = regDoubleHex.Replace(output, m => encoding.GetString(new[]{
 byte.Parse(m.Groups[1].Value, System.Globalization.NumberStyles.HexNumber),
 byte.Parse(m.Groups[2].Value, System.Globalization.NumberStyles.HexNumber)
 }));
 var regRemainingHex = new Regex("=([0-9A-F]{2})");
 output = regRemainingHex.Replace(output, m => encoding.GetString(new[]{
 byte.Parse(m.Groups[1].Value, System.Globalization.NumberStyles.HexNumber)
 }));
 return output;
}

This seems to work as expected for what's currently in those messages.
Note that messages don't contain 4-bit utf-8 characters (f.e. 0xf0 0x90 0x8c 0xb8 = 𐌸)

But can this be simplified?
Perhaps there's already a standard function?

I searched, but haven't found a good standard build-in C# function that already does this type of conversion.

Well, except for an example that uses a function from System.Net.Mail.
But it seems very error-prone and requires a very specific format.

var input = "bl=61=C2=B0"; 
var output = System.Net.Mail.Attachment.CreateAttachmentFromString("", "=?utf-8?Q?" + input.Trim() +"?=").Name;
t3chb0t
44.6k9 gold badges84 silver badges190 bronze badges
asked Feb 5, 2018 at 16:06
\$\endgroup\$
1
  • \$\begingroup\$ Your data is encoded as quoted printable. Maybe this keyword helps you find an existing library function. It definitely exists somewhere. \$\endgroup\$ Commented Feb 5, 2018 at 17:51

1 Answer 1

2
\$\begingroup\$

Are you willing to use % instead of =?

If so Uri.UnescapeDataString shall be sufficient. if not you can always Replace("=", "%") and use UnescapeDataString anyway.

Uri.UnescapeDataString("Test =64 =C2=AE =E1=A6=92 test".Replace("=", "%"))
//Test d ® ᦒ test
answered Feb 5, 2018 at 16:42
\$\endgroup\$
2
  • \$\begingroup\$ Seems nice at first glance. I'll test that tomorrow. Probably needs some tweaking on the replace. Because not all the "=" will be for a hex code. \$\endgroup\$ Commented Feb 5, 2018 at 16:47
  • \$\begingroup\$ I've changed it to a oneliner that only targets those hex codes: new Regex("(?:=[0-9A-F]{2})+").Replace(input, m => Uri.UnescapeDataString(m.Value.Replace("=","%"))) And after looking up quoted-printable it seems also the "=" signs get hexed. Thanks. \$\endgroup\$ Commented Feb 6, 2018 at 9:49

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.