Below is the method that I have written for reading from a text file. While reading, I need to match a line string to a given regex, and if it matches, I need to add the line string to a collection.
private static void GetOrigionalRGBColours(string txtFile)
{
string tempLineValue;
Regex regex = new Regex(@"^\d+.?\d* \d+.?\d* \d+.?\d* SRGB$");
using (StreamReader inputReader = new StreamReader(txtFile))
{
while (null != (tempLineValue = inputReader.ReadLine()))
{
if (regex.Match(tempLineValue).Success
&& tempLineValue != "1 1 1 SRGB"
&& tempLineValue != "0 0 0 SRGB")
{
string[] rgbArray = tempLineValue.Split(' ');
RGBColour rgbColour = new RGBColour() { Red = Convert.ToDecimal(rgbArray[0]), Green = Convert.ToDecimal(rgbArray[1]), Blue = Convert.ToDecimal(rgbArray[2]) };
originalColourList.Add(rgbColour);
}
}
}
}
When this method is run for a text file of 4MB having 28653 lines, it takes around 3 minutes just to finish the above method. Also, as a result of the above run, originalColourList
is populated with 582 items.
How can I improve the performance of this method? I had truncated the original text file for testing purpose. The actually text file size may go up to 60MB.
5 Answers 5
One option is to not perform any regular expression check at all. You can perform a quick match by checking if the line ends with SRGB
and the try the split and convert and simply continue if it fails. Something along these lines:
private static void GetOrigionalRGBColours(string txtFile)
{
using (StreamReader inputReader = new StreamReader(txtFile))
{
while (!inputReader.EndOfStream)
{
var line = inputReader.ReadLine();
if (line.EndsWith(" SRGB")
&& line != "1 1 1 SRGB"
&& line != "0 0 0 SRGB")
{
try
{
string[] rgbArray = line.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
if (rgbArray.Length == 4)
{
RGBColour rgbColour = new RGBColour() { Red = Convert.ToDecimal(rgbArray[0]), Green = Convert.ToDecimal(rgbArray[1]), Blue = Convert.ToDecimal(rgbArray[2]) };
originalColourList.Add(rgbColour);
}
}
catch (FormatException) {}
catch (OverflowException) {}
}
}
}
}
Alternatively to the try-catch
you can use decimal.TryParse()
and ignore the line if it returns false
for any of the 3 entries.
Update: Version with decimal.TryParse
.
private static void GetOrigionalRGBColours(string txtFile)
{
using (StreamReader inputReader = new StreamReader(txtFile))
{
while (!inputReader.EndOfStream)
{
var line = inputReader.ReadLine();
if (line.EndsWith(" SRGB")
&& line != "1 1 1 SRGB"
&& line != "0 0 0 SRGB")
{
string[] rgbArray = line.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
if (rgbArray.Length != 4) { continue; }
decimal red;
var hasRed = decimal.TryParse(rgbArray[0], out red);
decimal green;
var hasGreen = decimal.TryParse(rgbArray[1], out green);
decimal blue;
var hasBlue = decimal.TryParse(rgbArray[2], out blue);
if (hasRed && hasGreen && hasBlue)
{
RGBColour rgbColour = new RGBColour() { Red = red, Green = green, Blue = blue };
originalColourList.Add(rgbColour);
}
}
}
}
}
I like that one better because I find try {} catch {}
block always a bit ugly . It also conveys the intend of the code better.
-
\$\begingroup\$ In my test case reduced execution time by 90%! Nice! \$\endgroup\$Andris– Andris2014年01月31日 06:52:38 +00:00Commented Jan 31, 2014 at 6:52
-
\$\begingroup\$ Also change the two
tempLineValue
s toline
for the code to compile! \$\endgroup\$Andris– Andris2014年01月31日 07:03:54 +00:00Commented Jan 31, 2014 at 7:03 -
\$\begingroup\$ @Andris good catch, missed those, fixed \$\endgroup\$ChrisWue– ChrisWue2014年01月31日 07:06:56 +00:00Commented Jan 31, 2014 at 7:06
-
\$\begingroup\$ @Andris: You already seem to have a benchmark around. Do you mind running the
TryParse
version? I'd be interested to see if it makes a difference. \$\endgroup\$ChrisWue– ChrisWue2014年01月31日 08:01:11 +00:00Commented Jan 31, 2014 at 8:01 -
\$\begingroup\$ The
TryParse
variant look a little bit faster but I feel the difference is close to the margin of error of my ad-hoc test so I would say the difference is insignificant. \$\endgroup\$Andris– Andris2014年01月31日 08:22:02 +00:00Commented Jan 31, 2014 at 8:22
First, a sample of the input data would be helpful for testing. That said:
- If you're using .NET 4 or greater you can use the File.ReadLines() method to read all the lines at once which may speed things up.
- Try to narrow the complexity of the regex pattern. Usages of * and + are going to be slower than other options. Or find a way to avoid regex all together.
- It looks like you're trying to match decimal numbers with the
\d+.?\d*
pattern, but the.
character matches all characters. You may want the pattern\d+\.?\d*
. - Are you sure the input data is in the form of decimal colors? RGB implementations are almost always either bytes or integers.
-
2\$\begingroup\$ +1 especially for catching
\.
; this could easily lead to much backtracking assuming no literal.
and only single spaces between numbers. \$\endgroup\$Michael Urman– Michael Urman2014年01月30日 13:17:30 +00:00Commented Jan 30, 2014 at 13:17
One significant performance problem here is that you are doing two String operations on every line (well, almost every one). First, the line match, second the split. The second operation is also creating new String instances....
(included JoeClarks fix for the decimal point in the regex)
it is actually quite simple to bring this down to a single operation:
private static void GetOrigionalRGBColours(string txtFile)
{
string tempLineValue;
Regex regex = new Regex(@"^(\d+\.?\d*) (\d+\.?\d*) (\d+\.?\d*) SRGB$", RegexOptions.Compiled);
using (StreamReader inputReader = new StreamReader(txtFile))
{
while (null != (tempLineValue = inputReader.ReadLine()))
{
Match match = regex.Match(tempLineValue);
if (match.Success)
{
Decimal red = Convert.ToDecimal(match.Groups[1].Value);
Decimal green = Convert.ToDecimal(match.Groups[2].Value);
Decimal blue = Convert.ToDecimal(match.Groups[3].Value);
if ( !(red == 1 && green == 1 && blue == 1)
&& !(red == 0 && green == 0 && blue == 0))
{
originalColourList.Add(new RGBColour(red, green, blue));
}
}
}
}
}
-
\$\begingroup\$ I have run your code through a profiler and it looks to me that due to your regex being more complex with the inclusion of groups the execution time got longer. I have even observed a slight increase in runtime. Around +10% in execution time. \$\endgroup\$Andris– Andris2014年01月31日 07:00:06 +00:00Commented Jan 31, 2014 at 7:00
-
\$\begingroup\$ If I add the
RegexOptions.Compiled
to the ctor of regex I get a speed improvement of 40% in your code of the total execution time. \$\endgroup\$Andris– Andris2014年01月31日 07:01:37 +00:00Commented Jan 31, 2014 at 7:01 -
\$\begingroup\$ Also note that you have to write
Convert.ToDecimal(match.Groups[1].Value);
in order for your code to compile. \$\endgroup\$Andris– Andris2014年01月31日 07:02:14 +00:00Commented Jan 31, 2014 at 7:02 -
\$\begingroup\$ @Andris . of course, and of course. Thanks for running things through, I have u[dated the answer with your suggestions. \$\endgroup\$rolfl– rolfl2014年01月31日 11:15:53 +00:00Commented Jan 31, 2014 at 11:15
I have run your code but it finishes for me for an input file of 290000 lines of 8MB under a second. Could you provide a sample of the file?
Update: My guess at this point is that the non matching lines which appear to be very long consume the most time. You could try to exclude them by checking for something simple like if the first character is a digit or if their length is below some threshold.
Can you provided some details on the implementation of RGBColour
and what kind of collection originalColourList
is. For my test I have assumed originalColourList
is List<RGBColour>
and RGBColour
is a PODS like:
class RGBColour {
public decimal Red { get; set; }
public decimal Green {get; set;}
public decimal Blue {get; set;}
}
I would imagine your while loop is slowing things down,
If you read in the complete file using File.ReadAllLines()
then call Regex.Matches()
, foreache'd through each match,
it should be faster.
on a side note I personally like to store my large sql and regex filters/queries as a private const , obviously only if they ARE constant. though in this case it seems so.
Match
to about half! Definitely worth doing if you plan to use the same regexp a lot. \$\endgroup\$RegexOptions.Compiled
payed for itself after about 10000Match
es. \$\endgroup\$