Speeding up thousands of string parses

Question 1

I have a mapping application which takes string arguments in the form of string arrays. I parse these and then perform an action on the map. One of these is MoveVertex which lets the user pass in the name of an existing shape/feature on the map, the index of the vertex to move, and a new x,y location to move that vertex to.

I'm currently getting thousands of these in a row, all for the same shape. It takes anywhere from 3-10ish seconds to parse them all and complete the action. I'm looking for any improvements I could make.

One thing I did so far was stop the map from refreshing every time a vertex is moved. This helped a bit, but it's still slow.

Example of the string[] commands being sent:

command[0] = "move" //command type
command[1] = "vertex" //what to move
command[2] = "testLine[10]" //move the vertex at index 10 in testLine's list of vertices
command[3] = "x" //new x value for the vertex
command[4] = "y" //new y value for the vertex

Here's the function that does the work:

public static void MoveVertex(string[] command, Layer inMemoryFeatureLayer, Map map)
{
 int index; //this will be the index of the vertex in the collection of vertices of the shape
 string featureName; //name of the shape/feature that has the vertex to move
 if (command.Length < 5) //can't be more than 5 parameters
 {
 return;
 }
 try
 {
 if (!command[2].Contains('[')) //if it doesn't contain brackets which contain an index of the vertices
 {
 return;
 }
 const string pattern = @"\[(.*?)\]";
 var query = command[2];
 var matches = Regex.Matches(query, pattern); //Gets anything inside the brackets
 index = Convert.ToInt32(matches[0].Groups[1].Value); //should be an int
 featureName = command[2].Substring(0, command[2].IndexOf('[')).ToUpper(); //everything before the bracket is the name of the object
 }
 catch (Exception ex)
 {
 return;
 }
 if (!double.TryParse(command[3], out double longitude) || !double.TryParse(command[4], out double latitude))
 {
 return;
 }
 try
 {
 BaseShape shape;
 inMemoryFeatureLayer.Open(); //make sure the layer is open
 if (inMemoryFeatureLayer.Name.Contains(GlobalVars.LineType)) if it's the layer holding line shapes
 {
 shape = (LineShape)inMemoryFeatureLayer.FeatureSource.GetFeatureById(featureName, ReturningColumnsType.NoColumns).GetShape();
 ((LineShape)shape).Vertices[index] = new Vertex(longitude, latitude); //set the vertex to a new location
 }
 else //it's the layer holdilng polygone shapes
 {
 shape = (PolygonShape)inMemoryFeatureLayer.FeatureSource.GetFeatureById(featureName, ReturningColumnsType.NoColumns).GetShape();
 ((PolygonShape)shape).OuterRing.Vertices[index] = new Vertex(longitude, latitude); //set the vertex to a new location
 }
 inMemoryFeatureLayer.EditTools.BeginTransaction();
 inMemoryFeatureLayer.EditTools.Update(shape);
 inMemoryFeatureLayer.EditTools.CommitTransaction();
 map.Refresh(); //this won't happen because I suspend refreshes beforehand
 }
 catch (Exception ex)
 {
 //log it
 }
}

Question 2

if (command.Length < 5) //can't be more than 5 parameters - you mean less?

Question 3

Have you tried profiling your code? Executing the parsing part of the code a thousand times takes about 1-2 ms on my system, and while the suggestions here make that roughly 20% faster, that's not going to make much of a difference overall.

Question 4

Pieter's sugestion is good. Maybe GetFeatureById is a bottleneck, or maybe it's really fast; we can't tell from this snippet, but you can tell by using Visual Studio's code profiling tools.

Question 5

This is another good example of how you should not optimize your code ;-) First measure, than optimize when you know exactly what needs to be optimized. All these reviews are virtually in vain because they do not signifficantly improve your code but a little bit of its readability and as far as performance is concerned, irrelevant.

Question 6

You might see an improvement if you use compiled regular expressions instead of interpreted. Check out this article docs.microsoft.com/en-us/dotnet/standard/base-types/… which shows how to use the RegexOptions.Compiled option

Question 7

You should take a look at the possibility to name groups in regex patterns; your match pattern can then be a oneliner:

 const string pattern = @"^(?<name>\w+)\[(?<index>\d+)\]$";
 Match match = Regex.Match(command[2], pattern);
 if (match.Success)
 {
 string featureName = match.Groups["name"].Value;
 int index = int.Parse(match.Groups["index"].Value);
 try
 {
 // TODO: use the information
 }
 catch (Exception ex)
 {
 // TODO: Notify consumer
 }
 }
 else
 {
 // TODO: notify consumer
 }

As shown, the pattern for the index is changed from .*? to \d+ which is more precise and hence more efficient because it doesn't need the non-greedy operator ?.

if (command.Length < 5) //can't be more than 5 parameters
{
 return;
}

When meeting an error you just return from the method without doing anything. Shouldn't you notify the consumer/log the situation/condition?

You don't write from where the command information comes from, but if you read it from a text file, I think I would parse the string information while reading the file and then create a set of command objects defined as something like:

public class CommandInfo
{
 public string Action { get; set; }
 public string Type { get; set; }
 public string ListName { get; set; }
 public int Index { get; set; }
 public Point Vertex { get; set; }
}

Or whatever is suitable for your needs

This may not improve performance but it is more "type safe" than just dispatching an array of strings around.

According to efficiency:

You write, that you have thousands of commands for the same shape, but it seems that you requery a set of shapes inMemoryFeatureLayer.FeatureSource for every command. Whouldn't it be more efficient if you group the commands for each shape, so that you only have to query for the shape once and then can execute all the related commands?

TIP: If you are using Visual Studio 2019, you get help from intellisense and color coding if you compose the regex pattern directly in the call to Match or Matches:

Match match = Regex.Match(command[2], @"^(?<name>\w+)\[(?<index>\d+)\]$");

UPDATE

FYI: there has been some discussion in the comments to various answers about which approach is the most efficient, when extracting the name of the list of vertices and the index from a string like command[2] = "testLine[10]" => "testLine" and 10. So I ran some comparison tests for the following methods:

const string pattern = @"^(?<name>\w+)\[(?<index>\d+)\]$";
Regex regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.Singleline);
void DoRegex()
{
 foreach (string item in data)
 {
 Match match = regex.Match(item);
 if (match.Success)
 {
 string featureName = match.Groups["name"].Value;
 int index = int.Parse(match.Groups["index"].Value);
 testResult += index ^ featureName.GetHashCode();
 }
 else
 {
 throw new InvalidOperationException("DoRegex match");
 }
 }
}
void DoSplitString()
{
 foreach (string item in data)
 {
 string[] split = item.Split('[', ']');
 string featureName = split[0];
 if (int.TryParse(split[1], out int index))
 testResult += index ^ featureName.GetHashCode();
 else
 throw new InvalidOperationException("DoSplitString TryParse");
 }
}
void DoIndexOf()
{
 foreach (string item in data)
 {
 int endIndex = item.IndexOf('[');
 string featureName = item.Substring(0, endIndex);
 if (int.TryParse(item.Substring(endIndex + 1, item.Length - endIndex - 2), out int index))
 testResult += index ^ featureName.GetHashCode(); 
 else
 throw new InvalidOperationException("DoIndexOf TryParse");
 }
}
void DoStringBuilder()
{
 foreach (string item in data)
 {
 string featureName = "";
 int index = 0;
 char stop = '[';
 StringBuilder builder = new StringBuilder();
 foreach (char ch in item)
 {
 if (ch == stop)
 {
 if (stop == '[')
 {
 featureName = builder.ToString();
 stop = ']';
 }
 else
 {
 if (int.TryParse(builder.ToString(), out index))
 testResult += index ^ featureName.GetHashCode();
 else
 throw new InvalidOperationException("DoStringBuilder TryParse");
 break;
 }
 builder.Clear();
 }
 else
 {
 builder.Append(ch);
 }
 }
 }
}

I use a compiled instance for Regex, which reduces the duration for this test with about 600 ms compared to an uncompiled version - when it is compiled once and the same instance is reused for all tests. testResult serves as a simple validation.

The dataset is generated as (the data creation is not part of the measures):

const int testCount = 2000;
const int randSeed = 5;
Random rand = new Random(randSeed);
IList<string> data;
int testResult = 0;
IEnumerable<string> CreateData(int count)
{
 string chars = "abcdefghijklmnopqrstuwxyz";
 for (int i = 0; i < count; i++)
 {
 yield return $"{(string.Join("", Enumerable.Range(1, rand.Next(10, 30)).Select(n => chars[rand.Next(0, chars.Length)])))}[{rand.Next(0, 100)}]";
 }
}
public void Run()
{
 data = CreateData(1000).ToList();
 ...

I ran each method 2000 times in order to measure some realistic averages with the same dataset with 1000 strings for all measures. The result is rather disappointing for the Regex approach:

testResult: 2052366768
Result for test DoSplitString:
Iterations: 2000
Average: 0,34116 Milliseconds
Min: 0,29660 Milliseconds
Max: 1,17520 Milliseconds
Total: 682,31420 Milliseconds
testResult: 2052366768
Result for test DoRegex:
Iterations: 2000
Average: 1,69160 Milliseconds
Min: 1,48380 Milliseconds
Max: 6,16170 Milliseconds
Total: 3383,19930 Milliseconds
testResult: 2052366768
Result for test DoIndexOf:
Iterations: 2000
Average: 0,22634 Milliseconds
Min: 0,19200 Milliseconds
Max: 0,93010 Milliseconds
Total: 452,68460 Milliseconds
testResult: 2052366768
Result for test DoStringBuilder:
Iterations: 2000
Average: 0,36898 Milliseconds
Min: 0,33100 Milliseconds
Max: 1,36560 Milliseconds
Total: 737,96910 Milliseconds

IndexOf is by far the most efficient

As t3chb0t writes in his comments, the regex pattern can be optimized significantly to

const string pattern = @"^(?<name>.+)\[(?<index>.+)\]$";
static readonly Regex regex = new Regex(pattern, RegexOptions.Compiled);

Running the test with this pattern gives this result:

testResult: 2052366768
Result for test DoRegex:
Iterations: 2000
Average: 0,97967 Milliseconds
Truncated Average: 0,97777 Milliseconds
Min: 0,84020 Milliseconds
Max: 4,91550 Milliseconds
Total: 1959,33570 Milliseconds

Question 8

I see you've changed the evil wildcard match .*? to \d+ - it'd be good to mention that this is more efficient and also more exact. I believe this is the catastrophic backtracking or here

Question 9

About regex coloring; this is also possible for any string when you comment it with jetbrains helpers; see Regular Expressions Assistance

Question 10

Your DoIndexOf is cheating because DoRegex must do more work as you put such constraints on it as \w+ which the former does not do. When you use (.+) the regex becomes rougly 300ms faster for 1mln strings. Alternatively add a validation loop that would work the same as \w+ basically in every other method.

Question 11

The fastest and most fair regex is @"^(.+)\[(.+)\]" created like this new Regex(pattern, RegexOptions.Compiled); which additionaly improves its performance by another 300ms as SingleLine adds a check agianst \n. Removing the match.Success improves the regex method by another 30-50ms.

Question 12

If you made a second answer with the comparative review, you would have gotten 2 +1's from me :p

Question 13

One possible place you can improve performance is here:

const string pattern = @"\[(.*?)\]";
var query = command[2];
var matches = Regex.Matches(query, pattern); //Gets anything inside the brackets
index = Convert.ToInt32(matches[0].Groups[1].Value); //should be an int
featureName = command[2].Substring(0, command[2].IndexOf('[')).ToUpper();

You're parsing the same string twice. Using the Split method for such a simple parsing would probably improve that:

var separators = new char[] { '[', ']' };
var parts = command[2].Split(separators,StringSplitOptions.RemoveEmptyEntries);
index = Convert.ToInt32(parts[1]);
featureName = parts[0].ToUpper();

Question 14

Why would you want to use Regular Expressions anyway if the base format of the string is always the same?

Name[Number] seems like an easy pattern. Just iterate through the characters one by one and store all characters in the first string until you reach the first bracket. Then store the numbers in the second string (until you reach the closing bracket). Getting rid of the Regular Expression should make it faster.

Question 15

Hash all string constants

I'm presuming you only have a limited number of keywords ("move", "vertex" etc.). Hash all of those with something fast - CRC32 is perfectly adequate.

Split the string into space-separated tokens, as usual, and calculate the hash of the first and second tokens. Then you just need to compare the calculated hash with the hashes of the available commands. If you have a large number of keywords, this can yield very significant improvements.

Ditch the "Contains" and the regex

To check whether it contains a bracket, the code has to scan over the entire variable name. It then scans the entire string again to find the start and end of the brackets. And the processing necessary for the regex pattern-matching is not fast.

All you actually need is "IndexOf". If there's no bracket, IndexOf returns -1. If there's a bracket, you've found the start of the number. A second "IndexOf" to look for the closing bracket, starting at that index, will give you the end (or -1 if there isn't a closing bracket). Then all you need to do is copy the substring.

Question 16

Don't ditch one thing to replace it with a another similar, albeit improved thing. Go with Florian's approach of parsing in a single pass instead. en.wikipedia.org/wiki/Out_of_the_frying_pan_into_the_fire

Question 17

@dfhwze I'd disagree: I expect making a readable and efficient version of Florian's suggestion will be much more effort than throwing together a pair of IndexOf and Substring calls.

Question 18

@VisualMelon it all boils down how crucial performance is. I figured the OP holds performance very high, so Florian's approach is the superior one.

Question 19

@dfhwze I don't think that is a given: I'd personally expect this one to be faster, because of the added overheads of dealing with a StringBuilder (or similar) implied in Florin's approach, and substring and indexof have dodgy implementations underneath with special support for ASCII and such. If I can be bothered, I might put a benchmark together later (of course anyone else would be welcome to do so before me) since that's as close to an answer as we'll get without the rest of the OP's code.

Question 20

(Of course, no-one would do that, because it is silly; they would loop through to find the brackets and then use SubString, at which point the question is whether a loop is more efficient than IndexOf or not (which I'd expect depends on a few things))

Question 21

The first obvious optimization would be to not call Regex.Matches(string, string) which compiles the regular expression every single time. Compiling the regex is an expensive operation.

Instead, create a Regex (which compiles the expression exactly once) and keep it around during your several thousand invocations. That should by itself do the trick (because, well, a few thousand matches is really no challenge to a modern computer).

Also, if you are desperate, you could do without regex. You could switch by string length, which only leaves you to distinguish between x and y (all others have distinct string lengths).
Or, you could look at the first character (or the first, and another one, if you maybe have more keywords). Some minimal perfect hash generators (I forgot which one, gperf maybe?) apply that trick for very quickly matching a known set of keywords. Brackets are only possible in a particular context, so you don't really need to match for them all the time.

user73941user73941 · Accepted Answer · 2019-07-02 06:01:55Z

You should take a look at the possibility to name groups in regex patterns; your match pattern can then be a oneliner:

 const string pattern = @"^(?<name>\w+)\[(?<index>\d+)\]$";
 Match match = Regex.Match(command[2], pattern);
 if (match.Success)
 {
 string featureName = match.Groups["name"].Value;
 int index = int.Parse(match.Groups["index"].Value);
 try
 {
 // TODO: use the information
 }
 catch (Exception ex)
 {
 // TODO: Notify consumer
 }
 }
 else
 {
 // TODO: notify consumer
 }

As shown, the pattern for the index is changed from .*? to \d+ which is more precise and hence more efficient because it doesn't need the non-greedy operator ?.

if (command.Length < 5) //can't be more than 5 parameters
{
 return;
}

When meeting an error you just return from the method without doing anything. Shouldn't you notify the consumer/log the situation/condition?

You don't write from where the command information comes from, but if you read it from a text file, I think I would parse the string information while reading the file and then create a set of command objects defined as something like:

public class CommandInfo
{
 public string Action { get; set; }
 public string Type { get; set; }
 public string ListName { get; set; }
 public int Index { get; set; }
 public Point Vertex { get; set; }
}

Or whatever is suitable for your needs

This may not improve performance but it is more "type safe" than just dispatching an array of strings around.

According to efficiency:

You write, that you have thousands of commands for the same shape, but it seems that you requery a set of shapes inMemoryFeatureLayer.FeatureSource for every command. Whouldn't it be more efficient if you group the commands for each shape, so that you only have to query for the shape once and then can execute all the related commands?

TIP: If you are using Visual Studio 2019, you get help from intellisense and color coding if you compose the regex pattern directly in the call to Match or Matches:

Match match = Regex.Match(command[2], @"^(?<name>\w+)\[(?<index>\d+)\]$");

UPDATE

FYI: there has been some discussion in the comments to various answers about which approach is the most efficient, when extracting the name of the list of vertices and the index from a string like command[2] = "testLine[10]" => "testLine" and 10. So I ran some comparison tests for the following methods:

const string pattern = @"^(?<name>\w+)\[(?<index>\d+)\]$";
Regex regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.Singleline);
void DoRegex()
{
 foreach (string item in data)
 {
 Match match = regex.Match(item);
 if (match.Success)
 {
 string featureName = match.Groups["name"].Value;
 int index = int.Parse(match.Groups["index"].Value);
 testResult += index ^ featureName.GetHashCode();
 }
 else
 {
 throw new InvalidOperationException("DoRegex match");
 }
 }
}
void DoSplitString()
{
 foreach (string item in data)
 {
 string[] split = item.Split('[', ']');
 string featureName = split[0];
 if (int.TryParse(split[1], out int index))
 testResult += index ^ featureName.GetHashCode();
 else
 throw new InvalidOperationException("DoSplitString TryParse");
 }
}
void DoIndexOf()
{
 foreach (string item in data)
 {
 int endIndex = item.IndexOf('[');
 string featureName = item.Substring(0, endIndex);
 if (int.TryParse(item.Substring(endIndex + 1, item.Length - endIndex - 2), out int index))
 testResult += index ^ featureName.GetHashCode(); 
 else
 throw new InvalidOperationException("DoIndexOf TryParse");
 }
}
void DoStringBuilder()
{
 foreach (string item in data)
 {
 string featureName = "";
 int index = 0;
 char stop = '[';
 StringBuilder builder = new StringBuilder();
 foreach (char ch in item)
 {
 if (ch == stop)
 {
 if (stop == '[')
 {
 featureName = builder.ToString();
 stop = ']';
 }
 else
 {
 if (int.TryParse(builder.ToString(), out index))
 testResult += index ^ featureName.GetHashCode();
 else
 throw new InvalidOperationException("DoStringBuilder TryParse");
 break;
 }
 builder.Clear();
 }
 else
 {
 builder.Append(ch);
 }
 }
 }
}

I use a compiled instance for Regex, which reduces the duration for this test with about 600 ms compared to an uncompiled version - when it is compiled once and the same instance is reused for all tests. testResult serves as a simple validation.

The dataset is generated as (the data creation is not part of the measures):

const int testCount = 2000;
const int randSeed = 5;
Random rand = new Random(randSeed);
IList<string> data;
int testResult = 0;
IEnumerable<string> CreateData(int count)
{
 string chars = "abcdefghijklmnopqrstuwxyz";
 for (int i = 0; i < count; i++)
 {
 yield return $"{(string.Join("", Enumerable.Range(1, rand.Next(10, 30)).Select(n => chars[rand.Next(0, chars.Length)])))}[{rand.Next(0, 100)}]";
 }
}
public void Run()
{
 data = CreateData(1000).ToList();
 ...

I ran each method 2000 times in order to measure some realistic averages with the same dataset with 1000 strings for all measures. The result is rather disappointing for the Regex approach:

testResult: 2052366768
Result for test DoSplitString:
Iterations: 2000
Average: 0,34116 Milliseconds
Min: 0,29660 Milliseconds
Max: 1,17520 Milliseconds
Total: 682,31420 Milliseconds
testResult: 2052366768
Result for test DoRegex:
Iterations: 2000
Average: 1,69160 Milliseconds
Min: 1,48380 Milliseconds
Max: 6,16170 Milliseconds
Total: 3383,19930 Milliseconds
testResult: 2052366768
Result for test DoIndexOf:
Iterations: 2000
Average: 0,22634 Milliseconds
Min: 0,19200 Milliseconds
Max: 0,93010 Milliseconds
Total: 452,68460 Milliseconds
testResult: 2052366768
Result for test DoStringBuilder:
Iterations: 2000
Average: 0,36898 Milliseconds
Min: 0,33100 Milliseconds
Max: 1,36560 Milliseconds
Total: 737,96910 Milliseconds

IndexOf is by far the most efficient

As t3chb0t writes in his comments, the regex pattern can be optimized significantly to

const string pattern = @"^(?<name>.+)\[(?<index>.+)\]$";
static readonly Regex regex = new Regex(pattern, RegexOptions.Compiled);

Running the test with this pattern gives this result:

testResult: 2052366768
Result for test DoRegex:
Iterations: 2000
Average: 0,97967 Milliseconds
Truncated Average: 0,97777 Milliseconds
Min: 0,84020 Milliseconds
Max: 4,91550 Milliseconds
Total: 1959,33570 Milliseconds

I see you've changed the evil wildcard match .*? to \d+ - it'd be good to mention that this is more efficient and also more exact. I believe this is the catastrophic backtracking or here
About regex coloring; this is also possible for any string when you comment it with jetbrains helpers; see Regular Expressions Assistance
Your DoIndexOf is cheating because DoRegex must do more work as you put such constraints on it as \w+ which the former does not do. When you use (.+) the regex becomes rougly 300ms faster for 1mln strings. Alternatively add a validation loop that would work the same as \w+ basically in every other method.
The fastest and most fair regex is @"^(.+)\[(.+)\]" created like this new Regex(pattern, RegexOptions.Compiled); which additionaly improves its performance by another 300ms as SingleLine adds a check agianst \n. Removing the match.Success improves the regex method by another 30-50ms.
If you made a second answer with the comparative review, you would have gotten 2 +1's from me :p

Stack Exchange Network

Speeding up thousands of string parses

5 Answers 5

Hash all string constants

Ditch the "Contains" and the regex

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speeding up thousands of string parses

5 Answers 5

Hash all string constants

Ditch the "Contains" and the regex

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions