Looking to optimize this method so it will run quicker, but can't seem to find anything. WOrking in .NET 3.5. ExcludedWords and NewCodes are both HashSet of strings.
private bool isValid(String code)
{
String pattern = "[a-zA-Z]{3}";
if (chkLessThan2Letters.Checked && Regex.IsMatch(code, pattern))
{
return false;
}
foreach (string s in excludedWords)
{
if (code.Contains(s))
{
return false;
}
}
if (newCodes.Contains(code))
{
return false;
}
return true;
}
5 Answers 5
Before you optimize this, make sure that it actually needs optimization and that you're not wasting your time. Have you checked that this is a performance bottleneck in your application?
One way to optimize this is to store your excluded words in a datastructure like a Trie. This will let you see if some word in in the trie potentially faster than iterating through a list.
-
\$\begingroup\$ Can this work if I don't have pairs? I only have 1 datum, the new word. There is no key associated with it. \$\endgroup\$MAW74656– MAW746562012年05月04日 18:12:59 +00:00Commented May 4, 2012 at 18:12
-
\$\begingroup\$ yeah. You can just make the key=new word, value=true. You don't really care about the value, you just want the improved lookup performance. \$\endgroup\$Oleksi– Oleksi2012年05月04日 18:13:55 +00:00Commented May 4, 2012 at 18:13
Assuming this is really your bottleneck, a few options:
Create your Regex ahead of time and precompile it.
If excludedWords doesn't change while you're running, try creating a long OR Regex, like
(badWord1|badWord2|xyz)
and precompiling it. This should create a fairly efficient search-tree that will be faster than performing a repeated Contains call.
-
\$\begingroup\$ Would a longer OR Regex work with more than 500 items? At what point is there too many items? How could I precompile the regex? \$\endgroup\$MAW74656– MAW746562012年05月04日 18:03:47 +00:00Commented May 4, 2012 at 18:03
-
\$\begingroup\$ @MAW74656, for compilation, use this constructor and pass in
RegexOptions.Compiled
. You'll want to keep that instance around and use it for all your matching. As far as length, I believe it should work fine for over 500 items, but I recommend trying it out. A long Regex might take a little while to compile when you first construct it, but performance once constructed should be quite good. \$\endgroup\$Dan Bryant– Dan Bryant2012年05月04日 19:14:09 +00:00Commented May 4, 2012 at 19:14 -
\$\begingroup\$ -What about 5 million? I'm not just being difficult, I'd like to ability to efficiency handle large lists. \$\endgroup\$MAW74656– MAW746562012年05月04日 19:53:17 +00:00Commented May 4, 2012 at 19:53
-
\$\begingroup\$ @MAW74656, I haven't used a Regex on that scale, but it wouldn't be hard for you to try. I suspect the compilation time would be somewhat long, but the compiled performance would be significantly better than a brute-force repeated Contains search. \$\endgroup\$Dan Bryant– Dan Bryant2012年05月04日 20:05:01 +00:00Commented May 4, 2012 at 20:05
-
\$\begingroup\$ So, I can simply loop through excluded words and use a stringbuilder to manually build the regex? \$\endgroup\$MAW74656– MAW746562012年05月04日 20:14:44 +00:00Commented May 4, 2012 at 20:14
This could help:
private static readonly Regex matchCode = new Regex("[a-zA-Z]{3}", RegexOptions.Compiled);
private bool IsValid(string code)
{
if (this.chkLessThan2Letters.Checked && matchCode.IsMatch(code))
{
return false;
}
if (this.excludedWords.Any(code.Contains))
{
return false;
}
return !this.newCodes.Contains(code);
}
-
\$\begingroup\$ What does
this.excludedWords.Any(code.Contains)
do different than what I have? \$\endgroup\$MAW74656– MAW746562012年05月04日 19:47:07 +00:00Commented May 4, 2012 at 19:47 -
\$\begingroup\$ It takes it out of your hands and puts it into the hands of LINQ which likely employs various optimizations internally. \$\endgroup\$Jesse C. Slicer– Jesse C. Slicer2012年05月04日 20:08:03 +00:00Commented May 4, 2012 at 20:08
-
\$\begingroup\$ But its the same effect? If it finds any excluded words within code it will return true? \$\endgroup\$MAW74656– MAW746562012年05月04日 20:10:45 +00:00Commented May 4, 2012 at 20:10
-
1\$\begingroup\$ @MAW74656 yes. That code is equivalent to:
this.excludedWords.Any(word => code.Contains(word))
\$\endgroup\$ANeves– ANeves2012年05月04日 20:40:34 +00:00Commented May 4, 2012 at 20:40 -
\$\begingroup\$ @MAW74656 yes, the effect is identical to your original
foreach
loop with theContains
. \$\endgroup\$Jesse C. Slicer– Jesse C. Slicer2012年05月04日 21:50:19 +00:00Commented May 4, 2012 at 21:50
I was going to make this a comment to Dan's answer, but it started getting quite long (and I wanted to try and actually code it to see if I remembered how).
Assuming excludedWords
is big and doesn't change between builds of the application (or if it does you have the ability to replace a single dll), you should build a dll containing the compiled regex representing it to get a much better search then you would if you used that linq expression.
using System.Collections.Generic;
using System.Reflection;
using System.Reflection.Emit;
using System.Text;
using System.Text.RegularExpressions;
namespace RegexLibAssemblyBuilder
{
public class RegexCompilationTest
{
public static IEnumerable<string> ExcludedWords() {...}
public static void Main()
{
//build pattern
var excludedWordsRegex = new StringBuilder();
bool first = true;
foreach (var word in ExcludedWords())
{
if (!first)
{
excludedWordsRegex.Append("|");
}
excludedWordsRegex.Append(Regex.Escape(word));
first = false;
}
// Define regular expression here
var expr = new RegexCompilationInfo(excludedWordsRegex.ToString(),
RegexOptions.None,
"ExcludedWordsRegex",
"Utilities.RegularExpressions",
true);
//add other assembly attributes you want just like this
ConstructorInfo ctor = typeof(System.Reflection.AssemblyTitleAttribute).GetConstructor(new[] { typeof(string) });
CustomAttributeBuilder[] attBuilder = { new CustomAttributeBuilder(ctor, new object[] { "General-purpose library of compiled regular expressions" }) };
Regex.CompileToAssembly(new[] { expr }, new AssemblyName("RegexLib, Version=1.0.0.1001, Culture=neutral, PublicKeyToken=null"), attBuilder);
}
}
}
Then in code you could use it like this (after linking your project to this new assembly):
private static readonly Regex excludedWords= new ExcludedWordsRegex();
private bool IsValid(string code)
{
if (this.chkLessThan2Letters.Checked && matchCode.IsMatch(code))
{
return false;
}
if (excludedWords.IsMatch(code))
{
return false;
}
...
-
\$\begingroup\$ -Should this perform differently than just buidling the regex in memory? Does the extra dll help performance? \$\endgroup\$MAW74656– MAW746562012年05月08日 20:41:14 +00:00Commented May 8, 2012 at 20:41
-
\$\begingroup\$ This has the same performance as building the regex in memory when it is used, but it moves the initialization time of the regex out to a compile time task instead of a first time hit during runtime. \$\endgroup\$Bill Barry– Bill Barry2012年05月09日 13:12:39 +00:00Commented May 9, 2012 at 13:12
-
\$\begingroup\$ Ok, then I don't want to use that, because the exclusion file changes at runtime, the user can chose the one they want to use, then it stays the same for the entire run and all the iterations of the loops. \$\endgroup\$MAW74656– MAW746562012年05月09日 21:36:56 +00:00Commented May 9, 2012 at 21:36
I am basing my answer on Jesse's. It would be unwise to throw away some of his optimizations. The only improvement I can offer is not using Regex but doing the equivalent check manually, because it is such a simple regex. If you have to use a different one, then this will not work. Note that I also made the methods static, made them take every parameter explicitly, thus making this code more "functional". Also, it looks less like Java because method starts with a capital letter and I use string
instead of String
. You are probably storing stuff as a HashSet but I am taking Collection as an input. Firstly, I was not sure which exact container you have. Secondly, this should not affect the speed, but if you prefer to do it differently, then you may do so.
// using System.Linq;
private static bool IsValidCode(string code, bool checkLessThanTwoLetters, ICollection<string> excludedWords, ICollection<string> newCodes)
{
// TODO: The naming is confusing but all I did was re-factored the code.
if (checkLessThanTwoLetters && ContainsThreeLetterCode(code: code))
{
return false;
}
if (excludedWords.Any(code.Contains))
{
return false;
}
return !newCodes.Contains(code);
}
// I updated this method thanks to a comment by Bill.
private static bool ContainsThreeLetterCode(string code)
{
if (code == null || code.Length != 3)
{
return false;
}
int ctr = 0;
foreach(char ch in code)
{
ctr = Char.IsLetter(ch) ? ctr + 1 : 0;
if (ctr == 3)
{
return true;
}
}
return false;
}
-
\$\begingroup\$ The regex matches any string with at least 3 letters, not exactly 3 letters, but yes a simple method would be better. \$\endgroup\$Bill Barry– Bill Barry2012年05月05日 00:48:28 +00:00Commented May 5, 2012 at 0:48
-
\$\begingroup\$ @Bill Barry, I am not so sure. I checked the pattern using regexplanet.com/advanced/java/index.html and Pot would match while Potato would not. I know Regex has a ton of options, but the way I read it was - the letter MUST repeat three times. \$\endgroup\$Leonid– Leonid2012年05月05日 01:02:07 +00:00Commented May 5, 2012 at 1:02
-
\$\begingroup\$ Firstly the tags on this question suggest C#, but more importantly, the regex is not bound by either
^
or$
. Thus it can match anywhere in the string. If you want, here is a silverlight app that will test .NET regular expressions; try it yourself: regexhero.net/tester \$\endgroup\$Bill Barry– Bill Barry2012年05月05日 01:22:09 +00:00Commented May 5, 2012 at 1:22 -
\$\begingroup\$ @Bill Barry, thanks for the link. I updated my code. \$\endgroup\$Leonid– Leonid2012年05月05日 18:27:19 +00:00Commented May 5, 2012 at 18:27
newCodes
? Using aHashSet<string>
here will be beneficial. \$\endgroup\$newCodes.Contains
should be very fast, you could do that one first and save time on average. \$\endgroup\$