Remove unwanted characters from a string

Question 1

From what I've seen in other posts, if I actually know the unwanted characters, then I can do string.replace(). But in my case, any characters can appear in my input string, and I only want to keep the characters that I want (without messing up the order of course).

private string RemoveUnwantedChar(string input)
{
 string correctString = "";
 for (int i = 0; i < input.Length; i++)
 {
 if (char.IsDigit(input[i]) || input[i] == '.' || input[i] == '-' || input[i] == 'n'
 || input[i] == 'u' || input[i] == 'm' || input[i] == 'k' || input[i] == 'M'
 || input[i] == 'G' || input[i] == 'H' || input[i] == 'z' || input[i] == 'V'
 || input[i] == 's' || input[i] == '%')
 correctString += input[i];
 }
 return correctString;
}

Characters that I want:

0123456789
numkMGHzVs%-

How can I tidy this code to be neater and more readable?

Question 2

Cross-posted with Stack Overflow, question has an accepted answer there.

Question 3

Note that IsDigit accepts any unicode digit, not just the ASCII 0 to 9.

Question 4

@CodesInChaos I am a bit confused, can you please elaborate on that? I thought IsDigit checks for 0-9, while IsNumber checks for even more, like subscripts and fractions?

Question 5

@Liren There are digits besides 0-9. The full-width 0-9 and digits from various languages, e.g. ୫, ൬, ᠑

Question 6

By having a const string which contains all of your wanted chars, you could do either a simple call to Contains() or check if IndexOf() will return a value > -1.
using string concatenation in a loop is mostly a bad idea. Use a StringBuilder instead.
omitting braces {} although they are optional for single lined if statements is a bad idea because it makes your code error prone.

Implementing the mentioned points will lead to

private const string allowedCharacters = "numkMGHzVs%-.";
private string RemoveUnwantedChar(string input)
{
 StringBuilder builder = new StringBuilder(input.Length);
 for (int i = 0; i < input.Length; i++)
 {
 if (char.IsDigit(input[i]) || allowedCharacters.Contains(input[i]))
 {
 builder.Append(input[i]);
 }
 }
 return builder.ToString();
}

@Caricorc made a good suggestion in the comments

In my opinion allowedCharacters should be an argument to the function to allow reusability.

So by passing the allowedCharacters as an optional parameter with an additional check with IsNullOrEmpty().

If performance is an issue, you could also pass a HashSet<char> to the method or have an overloaded method like so

private string RemoveUnwantedChar(string input, string allowedCharacters = "0123456789numkMGHzVs%-.")
{
 if (string.IsNullOrEmpty(allowedCharacters)) { return input; }
 return RemoveUnwantedChar(input, new HashSet<char>(allowedCharacters));
} 
private string RemoveUnwantedChar(string input, HashSet<char> allowedCharacters)
{
 if (allowedCharacters.Count == 0) { return input; }
 StringBuilder builder = new StringBuilder(input.Length);
 for (int i = 0; i < input.Length; i++)
 {
 if (allowedCharacters.Contains(input[i]))
 {
 builder.Append(input[i]);
 }
 }
 return builder.ToString();
}

you can reuse it somewhere else.

Question 7

In my opinion allowedCharacters should be an argument to the function to allow reusability.

Question 8

I think char.IsDigit(input[i]) || is a problem in the second function, as the user may or may not allow digits

Question 9

Perfect :) , my last suggestion is converting the allowedchars to a set to reduce time complexity from n^2 to n. (Actually I should have posted all of this as an aswer but now it is too late to convert it)

Question 10

I feel that RemoveUnwantedCharacters should have a blacklist parameter, not a whitelist parameter. This really should be ConformToWhitelist or something - we're not removing unwanted characters so much as we're keeping wanted characters. A subtle difference, but a difference nevertheless. I would assume RemoveUnwantedCharacters would ask me for characters I don't want!

Question 11

@corsiKa very good point actually! I did feel the same way after that, I now changed my method name to FilterString.

Question 12

Whenever I see filtering, I tend to think LINQ.

private string RemoveUnwantedCharacters(string input, IEnumerable<char> allowedCharacters)
{
 var filtered = input.ToCharArray()
 .Where(c => allowedCharacters.Contains(c))
 .ToArray();
 return new String(filtered);
}

You can call it like this:

string filteredString = RemoveUnwantedCharacters(inputString, "0123456789numkMGHzVs%-.");

Code is shorter
The intent is clear - it basically reads as "filtered is input where allowed characters contains this character", which is pretty self-explanatory
Allowed characters is a parameter, so you can reuse the method in various places. If you're using the same set of allowed characters in a lot of places, stick them in some sort of settings store.

Question 13

Using AsEnumerable on the input would be even more LINQ-ish ;-)

Question 14

1) A string is already an IEnumerable<char>. Both AsEnumerable and ToCharArray are useless here. 2) No need for a lambda. Just use .Where(allowedCharacters.Contains)

Question 15

@CodesInChaos while .Where(allowedCharacters.Contains) works, it takes me much more mental effort to understand than .Where(c => allowedCharacters.Contains(c)). I think the increased clarity is well worth the 8 extra keypresses.

Question 16

@CarlLeth - I think that's a personal preference thing; I much prefer the more concise version.

Question 17

@Liren same effect from opposite routes. The .Contains() is checking "does that collection contain this character" while the .Any() is checking "are any of the characters in this collection the same as this one". They end up doing the same thing, but you're doing with .Where() and .Any() is to manually do what .Contains() is actually intended for.

Question 18

Same answer as on SO

Whenever you have to search for literals. Regex is the way to go.

public string RemoveUnwantedChar(string input) {
 StringBuilder stringBuilder = new StringBuilder();
 foreach (var match in Regex.Matches(input, "[0-9numkMGHzVs%\\-.]")) {
 stringBuilder.Append(match.ToString());
 }
 return stringBuilder.ToString();
}

Code is shorter
Very easy to expand
Easy to read
Easy to follow the Code

Second Solution. A nice OneLiner as Taemyr suggested:

public string RemoveUnwantedChar(string input) {
 return Regex.Replace(input, "[^0-9numkMGHzVs%\\-.]", "");
 }

// Edit from String concatenation to StringBuilder implementation for better Performace especially for large inputs

// Edit2 Escaped the Dash for more Info: https://stackoverflow.com/questions/9589074/regex-should-hyphens-be-escaped

Question 19

You have presented an alternative solution, but haven't reviewed the code. Please explain your reasoning (how your solution works and how it improves upon the original) so that the author can learn from your thought process.

Question 20

String concatenation in a loop, while present in the original code, is an anti-pattern (negative impact on performance, possibly a problem for large inputs)

Question 21

First; you need to escape the dash. Second; your function body can be replaced with return Regex.Replace(input,"[^0-9numkMGHzVs%\\-.]","")

Question 22

You can alternatively put the hyphen at the beginning or end of the character class.

Heslacher Heslacher 50.9k5 gold badges83 silver badges177 bronze badges · Accepted Answer · 2015-11-02 10:12:53Z

By having a const string which contains all of your wanted chars, you could do either a simple call to Contains() or check if IndexOf() will return a value > -1.
using string concatenation in a loop is mostly a bad idea. Use a StringBuilder instead.
omitting braces {} although they are optional for single lined if statements is a bad idea because it makes your code error prone.

Implementing the mentioned points will lead to

private const string allowedCharacters = "numkMGHzVs%-.";
private string RemoveUnwantedChar(string input)
{
 StringBuilder builder = new StringBuilder(input.Length);
 for (int i = 0; i < input.Length; i++)
 {
 if (char.IsDigit(input[i]) || allowedCharacters.Contains(input[i]))
 {
 builder.Append(input[i]);
 }
 }
 return builder.ToString();
}

@Caricorc made a good suggestion in the comments

In my opinion allowedCharacters should be an argument to the function to allow reusability.

So by passing the allowedCharacters as an optional parameter with an additional check with IsNullOrEmpty().

If performance is an issue, you could also pass a HashSet<char> to the method or have an overloaded method like so

private string RemoveUnwantedChar(string input, string allowedCharacters = "0123456789numkMGHzVs%-.")
{
 if (string.IsNullOrEmpty(allowedCharacters)) { return input; }
 return RemoveUnwantedChar(input, new HashSet<char>(allowedCharacters));
} 
private string RemoveUnwantedChar(string input, HashSet<char> allowedCharacters)
{
 if (allowedCharacters.Count == 0) { return input; }
 StringBuilder builder = new StringBuilder(input.Length);
 for (int i = 0; i < input.Length; i++)
 {
 if (allowedCharacters.Contains(input[i]))
 {
 builder.Append(input[i]);
 }
 }
 return builder.ToString();
}

you can reuse it somewhere else.

In my opinion allowedCharacters should be an argument to the function to allow reusability.
I think char.IsDigit(input[i]) || is a problem in the second function, as the user may or may not allow digits
Perfect :) , my last suggestion is converting the allowedchars to a set to reduce time complexity from n^2 to n. (Actually I should have posted all of this as an aswer but now it is too late to convert it)
I feel that RemoveUnwantedCharacters should have a blacklist parameter, not a whitelist parameter. This really should be ConformToWhitelist or something - we're not removing unwanted characters so much as we're keeping wanted characters. A subtle difference, but a difference nevertheless. I would assume RemoveUnwantedCharacters would ask me for characters I don't want!
@corsiKa very good point actually! I did feel the same way after that, I now changed my method name to FilterString.

Stack Exchange Network

Remove unwanted characters from a string

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Remove unwanted characters from a string

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions