From a system we receive messages that contain codes that represent UTF-8 characters.
For example :
var str="Test =64 =C2=AE =E1=A6=92 test";
To decode these codes to utf-8 I've added a simple function that does 3 regex replacements:
protected static string ReplaceHexCodesInString(string input)
{
var output = input;
var encoding = Encoding.UTF8;
var regTripleHex = new Regex("=(E[0-9A-F])=([0-9A-F]{2})=([0-9A-F]{2})");
output = regTripleHex.Replace(output, m => encoding.GetString(new[]{
byte.Parse(m.Groups[1].Value, System.Globalization.NumberStyles.HexNumber),
byte.Parse(m.Groups[2].Value, System.Globalization.NumberStyles.HexNumber),
byte.Parse(m.Groups[3].Value, System.Globalization.NumberStyles.HexNumber)
}));
var regDoubleHex = new Regex("=([C-D][0-9A-F])=([0-9A-F]{2})");
output = regDoubleHex.Replace(output, m => encoding.GetString(new[]{
byte.Parse(m.Groups[1].Value, System.Globalization.NumberStyles.HexNumber),
byte.Parse(m.Groups[2].Value, System.Globalization.NumberStyles.HexNumber)
}));
var regRemainingHex = new Regex("=([0-9A-F]{2})");
output = regRemainingHex.Replace(output, m => encoding.GetString(new[]{
byte.Parse(m.Groups[1].Value, System.Globalization.NumberStyles.HexNumber)
}));
return output;
}
This seems to work as expected for what's currently in those messages.
Note that messages don't contain 4-bit utf-8 characters
(f.e. 0xf0 0x90 0x8c 0xb8 = 𐌸)
But can this be simplified?
Perhaps there's already a standard function?
I searched, but haven't found a good standard build-in C# function that already does this type of conversion.
Well, except for an example that uses a function from System.Net.Mail
.
But it seems very error-prone and requires a very specific format.
var input = "bl=61=C2=B0";
var output = System.Net.Mail.Attachment.CreateAttachmentFromString("", "=?utf-8?Q?" + input.Trim() +"?=").Name;
-
\$\begingroup\$ Your data is encoded as quoted printable. Maybe this keyword helps you find an existing library function. It definitely exists somewhere. \$\endgroup\$Roland Illig– Roland Illig2018年02月05日 17:51:12 +00:00Commented Feb 5, 2018 at 17:51
1 Answer 1
Are you willing to use %
instead of =
?
If so Uri.UnescapeDataString
shall be sufficient. if not you can always Replace("=", "%")
and use UnescapeDataString
anyway.
Uri.UnescapeDataString("Test =64 =C2=AE =E1=A6=92 test".Replace("=", "%"))
//Test d ® ᦒ test
-
\$\begingroup\$ Seems nice at first glance. I'll test that tomorrow. Probably needs some tweaking on the replace. Because not all the "=" will be for a hex code. \$\endgroup\$LukStorms– LukStorms2018年02月05日 16:47:55 +00:00Commented Feb 5, 2018 at 16:47
-
\$\begingroup\$ I've changed it to a oneliner that only targets those hex codes:
new Regex("(?:=[0-9A-F]{2})+").Replace(input, m => Uri.UnescapeDataString(m.Value.Replace("=","%")))
And after looking up quoted-printable it seems also the "=" signs get hexed. Thanks. \$\endgroup\$LukStorms– LukStorms2018年02月06日 09:49:28 +00:00Commented Feb 6, 2018 at 9:49