Decode string with hex character codes to UTF-8 characters

Question 1

From a system we receive messages that contain codes that represent UTF-8 characters.

For example :

var str="Test =64 =C2=AE =E1=A6=92 test";

To decode these codes to utf-8 I've added a simple function that does 3 regex replacements:

protected static string ReplaceHexCodesInString(string input)
{
 var output = input;
 var encoding = Encoding.UTF8;
 var regTripleHex = new Regex("=(E[0-9A-F])=([0-9A-F]{2})=([0-9A-F]{2})");
 output = regTripleHex.Replace(output, m => encoding.GetString(new[]{
 byte.Parse(m.Groups[1].Value, System.Globalization.NumberStyles.HexNumber),
 byte.Parse(m.Groups[2].Value, System.Globalization.NumberStyles.HexNumber),
 byte.Parse(m.Groups[3].Value, System.Globalization.NumberStyles.HexNumber)
 }));
 var regDoubleHex = new Regex("=([C-D][0-9A-F])=([0-9A-F]{2})");
 output = regDoubleHex.Replace(output, m => encoding.GetString(new[]{
 byte.Parse(m.Groups[1].Value, System.Globalization.NumberStyles.HexNumber),
 byte.Parse(m.Groups[2].Value, System.Globalization.NumberStyles.HexNumber)
 }));
 var regRemainingHex = new Regex("=([0-9A-F]{2})");
 output = regRemainingHex.Replace(output, m => encoding.GetString(new[]{
 byte.Parse(m.Groups[1].Value, System.Globalization.NumberStyles.HexNumber)
 }));
 return output;
}

This seems to work as expected for what's currently in those messages.
Note that messages don't contain 4-bit utf-8 characters (f.e. 0xf0 0x90 0x8c 0xb8 = 𐌸)

But can this be simplified?
Perhaps there's already a standard function?

I searched, but haven't found a good standard build-in C# function that already does this type of conversion.

Well, except for an example that uses a function from System.Net.Mail.
But it seems very error-prone and requires a very specific format.

var input = "bl=61=C2=B0"; 
var output = System.Net.Mail.Attachment.CreateAttachmentFromString("", "=?utf-8?Q?" + input.Trim() +"?=").Name;

Question 2

Your data is encoded as quoted printable. Maybe this keyword helps you find an existing library function. It definitely exists somewhere.

Question 3

Are you willing to use % instead of =?

If so Uri.UnescapeDataString shall be sufficient. if not you can always Replace("=", "%") and use UnescapeDataString anyway.

Uri.UnescapeDataString("Test =64 =C2=AE =E1=A6=92 test".Replace("=", "%"))
//Test d ® ᦒ test

Question 4

Seems nice at first glance. I'll test that tomorrow. Probably needs some tweaking on the replace. Because not all the "=" will be for a hex code.

Question 5

I've changed it to a oneliner that only targets those hex codes: new Regex("(?:=[0-9A-F]{2})+").Replace(input, m => Uri.UnescapeDataString(m.Value.Replace("=","%"))) And after looking up quoted-printable it seems also the "=" signs get hexed. Thanks.

Bruno Costa Bruno CostaBruno Costa 5,59620 silver badges42 bronze badges · Accepted Answer · 2018-02-05 16:42:46Z

2

\$\begingroup\$

Are you willing to use % instead of =?

If so Uri.UnescapeDataString shall be sufficient. if not you can always Replace("=", "%") and use UnescapeDataString anyway.

Uri.UnescapeDataString("Test =64 =C2=AE =E1=A6=92 test".Replace("=", "%"))
//Test d ® ᦒ test

Share

answered Feb 5, 2018 at 16:42

Bruno Costa's user avatar

Bruno Costa Bruno CostaBruno Costa

5,59620 silver badges42 bronze badges

\$\endgroup\$

2

\$\begingroup\$ Seems nice at first glance. I'll test that tomorrow. Probably needs some tweaking on the replace. Because not all the "=" will be for a hex code. \$\endgroup\$

LukStorms
– LukStorms

2018年02月05日 16:47:55 +00:00
Commented Feb 5, 2018 at 16:47
\$\begingroup\$ I've changed it to a oneliner that only targets those hex codes: new Regex("(?:=[0-9A-F]{2})+").Replace(input, m => Uri.UnescapeDataString(m.Value.Replace("=","%"))) And after looking up quoted-printable it seems also the "=" signs get hexed. Thanks. \$\endgroup\$

LukStorms
– LukStorms

2018年02月06日 09:49:28 +00:00
Commented Feb 6, 2018 at 9:49

Add a comment |

Stack Exchange Network

Decode string with hex character codes to UTF-8 characters

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Decode string with hex character codes to UTF-8 characters

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions