5
\$\begingroup\$

How can this be minimized?

// remove accent
byte[] bytes = System.Text.Encoding.UTF8.GetBytes(input);
input = System.Text.Encoding.UTF8.GetString(bytes);
// make it all lower case
input = input.ToLower();
// remove stop words
input = System.Text.RegularExpressions.Regex.Replace(input, "\\b" + string.Join("\\b|\\b", ENGLISH_STOP_WORDS) + "\\b", "");
// remove entities
input = System.Text.RegularExpressions.Regex.Replace(input, @"&\w+;", "");
// remove anything that is not letters, numbers, dash, or space
input = System.Text.RegularExpressions.Regex.Replace(input, @"[^a-z0-9\-\s]", "");
// replace spaces
input = input.Replace(' ', '-');
// collapse dashes
input = System.Text.RegularExpressions.Regex.Replace(input, @"-{2,}", "-");
// collapse spaces
input = System.Text.RegularExpressions.Regex.Replace(input, @"\s+", " ").Trim();
// Trim dashes and spaces
input = input.Trim(' ').Trim('-').Trim(' '); // double trim the spaces incase dashes were covering them
return input;
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Sep 2, 2012 at 8:03
\$\endgroup\$
2
  • \$\begingroup\$ What is the final intent of this snippet? \$\endgroup\$ Commented Sep 2, 2012 at 9:19
  • \$\begingroup\$ i think its quite obvious, including the comments... it should generate a UrlFriendly Slug. \$\endgroup\$ Commented Sep 2, 2012 at 10:14

1 Answer 1

4
\$\begingroup\$
// remove accent

Actually, no. The following code is just a lossless conversion to and from UTF-8 which doesn’t change the text.

In the following, I’d coalesce the regular expressions – if nothing else, this is way more efficient. I’d also import the namespace to get rid of this overlong explicit namespace qualification. The "collapse spaces" phase makes no sense since you’ve already removed spaces.

Finally, you can also coalesce the Trim statements.

Ignoring for now that the accent removal doesn’t work, this leaves us with:

input = input.ToLower();
// remove stop words, entities and anything that is not letters, numbers, dash, or space
string stopWords = string.Format("\\b{0}\\b", string.Join("\\b|\\b", ENGLISH_STOP_WORDS));
input = Regex.Replace(input, stopWords + @"|&\w+;|[^a-z0-9\-\s]", "");
// replace spaces
input = input.Replace(' ', '-');
// collapse dashes
input = Regex.Replace(input, @"-{2,}", "-");
// Trim dashes and spaces
input = input.Trim(' ', '-');

Finally, to remove accents, you need to normalize the Unicode string so that accented characters are decomposed into diacritics and remove combining diacritic marks:

static string RemoveDiacritics(string stIn) {
 string stFormD = stIn.Normalize(NormalizationForm.FormD);
 StringBuilder sb = new StringBuilder();
 for(int ich = 0; ich < stFormD.Length; ich++) {
 UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
 if(uc != UnicodeCategory.NonSpacingMark) {
 sb.Append(stFormD[ich]);
 }
 }
 return sb.ToString();
}
Shog9
1012 silver badges8 bronze badges
answered Sep 2, 2012 at 12:20
\$\endgroup\$
3
  • \$\begingroup\$ nice, tested and works like a charm, interesting to see what others come up with. \$\endgroup\$ Commented Sep 2, 2012 at 19:39
  • \$\begingroup\$ Why not make the regex a compiled regex: Regex theRegex = new Regex(stopWords + @"|&\w+;|[^a-z0-9\-\s]", RegexOptions.Compiled); then theRegex.Replace(...) This should make the calls a little quicker, and use less resources. \$\endgroup\$ Commented Sep 4, 2012 at 5:23
  • 1
    \$\begingroup\$ @Jeff True, it’s potentially very useful. But this should never be the default option because it creates a memory leak: there is no way of unloading the compiled regex from memory until the end of the application domain. As such, only unchanging, frequently used regexes should be compiled. But this is probably the case here anyway. \$\endgroup\$ Commented Sep 4, 2012 at 7:01

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.