Removing accents from certain characters

Question 1

I have a method that I am using to remove accents from certain characters. The problem is the massive slew of characters I am expected to work with. I have to, basically, remove accents from all Latin characters that fit within the 26 English Latin characters. (A through Z.) Performance is a very large requirement. It has to be lightning fast, as I have to run this on every character within a string, and process many large strings at a time.

Currently, I use a gigantic switch statement to detect what character it is, and return the appropriate A through Z "naked" character, while preserving case.

As of now, my switch looks something like the following:

switch (input)
{
 case 'À': // 0192
 case 'Á': // 0193
 case 'Â': // 0194
 case 'Ã': // 0195
 case 'Ä': // 0196
 case 'Å': // 0197
 case 'Ā': // 0256
 case 'Ă': // 0258
 case 'Ą': // 0260
 return 'A';
 case 'Ç': // 0199
 case 'Ć': // 0262
 case 'Ĉ': // 0264
 case 'Ċ': // 0266
 case 'Č': // 0268
 return 'C';
 case 'Ď': // 0270
 case 'Đ': // 0272
 return 'D';
 // Other upper case characters
 case 'à': // 0224
 case 'á': // 0225
 case 'â': // 0226
 case 'ã': // 0227
 case 'ä': // 0228
 case 'å': // 0229
 case 'ā': // 0257
 case 'ă': // 0259
 case 'ą': // 0261
 return 'a';
 case 'ç': // 0231
 case 'ć': // 0263
 case 'ĉ': // 0265
 case 'ċ': // 0267
 case 'č': // 0269
 return 'c';
 case 'ď': // 0271
 case 'đ': // 0273
 return 'D';
 // Other lower case characters
 default:
 return input;
}

As you can probably imagine, this method is over 200 lines, and this is the only thing it does.

private char RemoveAccent(char input)
{
 switch (input)
 {
 // You saw all the case statements
 }
}

Literally, that is it. My questions come down to the following, and this is more of a question of performance/better ways of handling the situation.

I know I can take a Regex and do the same thing very easily. It's what I used in the beginning as a shortcut. The problem, is that the regex is phenomenally slow. Essentially, what I was doing is looping over the 26 alphas, and then using the regex "[ALPHACHARACTER]", and replacing it with ALPHACHARACTER.

This method, however, was horrendous in performance. Over a set of input x, with data logic y, it was a whopping 1500ms.

Now, when I exclude this method from the logic, it performed at about 110ms. This is with the same input x and logic y, mind you. The only difference is that this method had a body of return input;.

So, I went to the switch method, which is very high in performance, and very low in maintainability. (The Calculate Code Metrics in Visual Studio puts it at a 41 for this method.) This one performed at 120ms with the exact same remaining logic 'y' and input 'x' as the Regex option. This allows me to exclude all other logic as being the issue.

The only other option I can think of, is to use an if statement on character ranges to at least make it more maintainable. The question on it is performance.

if ((input >= 'À' && input <= 'Å') || (input >= 'Ā' && input <= 'Ą' && input % 2 == 0))
 return 'A';
else if ((input == 'Ç') || (input >= 'Ć' && input <= 'Č' && input % 2 == 0))
 return 'C';
else if (input >= 'Ď' && input <= 'Đ' && input % 2 == 0)
 return 'D';
// Other upper case characters
else if ((input >= 'à' && input <= 'å') || (input >= 'ā' && input <= 'ą' && input % 2 == 1))
 return 'A';
else if ((input == 'ç') || (input >= 'ć' && input <= 'č' && input % 2 == 1))
 return 'C';
else if (input >= 'ď' && input <= 'đ' && input % 2 == 1)
 return 'D';
// Other lower case characters
return input;

Also, this is not all of the inputs. The 200 lines I mentioned earlier doesn't even cover a quarter of the characters I have to support. The thing is, I find the switch more readable, personally.

My questions are:

Is there a significantly more readable way, to write this, without significantly losing performance?
Is my switch statement the way to go if I favour performance over maintainability?
Should I stick with it regardless of what metrics Visual Studio might mention if it's more readable/understandable to me, and already has acceptable performance?
Is there something already in .NET to do what I am that might perform faster?

The only reason I ask these things, is that I feel like I'm doing something wrong, like I'm missing something in the way I approach this.

Personally, I think that since I find it readable, and it's already fast enough, there's no point to change it. The only problem is that I still have a lot of options to add to it, so it will only grow in size. The management of it could get a bit out of hand.

Question 2

This appears to be a duplicate of this question. The link suggests using .NET's String.Normalize. If it's too slow, you could simply create an associative array (e.g., a Dictionary that maps char->char) for constant-time lookup. This is going to be large, too, but I would think it's probably easier to maintain.

Question 3

Interesting, that seems to be working quite well, and is just as quick as my switch. (Maybe even faster.) I think it's safe to say that this resolves my issue, and I'm not sure why I decided to do it the hard way. (I'm not sure how I missed that SO question. I guess I just assumed no one else has this issue.)

Ethan Ethan 2262 silver badges2 bronze badges · Accepted Answer · 2015-06-12 17:42:05Z

11

\$\begingroup\$

This appears to be a duplicate of this question. The link suggests using .NET's String.Normalize. If it's too slow, you could simply create an associative array (e.g., a Dictionary that maps char->char) for constant-time lookup. This is going to be large, too, but I would think it's probably easier to maintain.

Share

edited May 23, 2017 at 12:40

Community's user avatar

Community Bot

1

answered Jun 12, 2015 at 17:42

Ethan's user avatar

Ethan Ethan

2262 silver badges2 bronze badges

\$\endgroup\$

1

\$\begingroup\$ Interesting, that seems to be working quite well, and is just as quick as my switch. (Maybe even faster.) I think it's safe to say that this resolves my issue, and I'm not sure why I decided to do it the hard way. (I'm not sure how I missed that SO question. I guess I just assumed no one else has this issue.) \$\endgroup\$

Der Kommissar
– Der Kommissar

2015年06月12日 17:50:30 +00:00
Commented Jun 12, 2015 at 17:50

Add a comment |

Stack Exchange Network

Removing accents from certain characters

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Removing accents from certain characters

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions