I have a mapping application which takes string arguments in the form of string arrays. I parse these and then perform an action on the map. One of these is MoveVertex
which lets the user pass in the name of an existing shape/feature on the map, the index of the vertex to move, and a new x,y location to move that vertex to.
I'm currently getting thousands of these in a row, all for the same shape. It takes anywhere from 3-10ish seconds to parse them all and complete the action. I'm looking for any improvements I could make.
One thing I did so far was stop the map from refreshing every time a vertex is moved. This helped a bit, but it's still slow.
Example of the string[] commands being sent:
command[0] = "move" //command type
command[1] = "vertex" //what to move
command[2] = "testLine[10]" //move the vertex at index 10 in testLine's list of vertices
command[3] = "x" //new x value for the vertex
command[4] = "y" //new y value for the vertex
Here's the function that does the work:
public static void MoveVertex(string[] command, Layer inMemoryFeatureLayer, Map map)
{
int index; //this will be the index of the vertex in the collection of vertices of the shape
string featureName; //name of the shape/feature that has the vertex to move
if (command.Length < 5) //can't be more than 5 parameters
{
return;
}
try
{
if (!command[2].Contains('[')) //if it doesn't contain brackets which contain an index of the vertices
{
return;
}
const string pattern = @"\[(.*?)\]";
var query = command[2];
var matches = Regex.Matches(query, pattern); //Gets anything inside the brackets
index = Convert.ToInt32(matches[0].Groups[1].Value); //should be an int
featureName = command[2].Substring(0, command[2].IndexOf('[')).ToUpper(); //everything before the bracket is the name of the object
}
catch (Exception ex)
{
return;
}
if (!double.TryParse(command[3], out double longitude) || !double.TryParse(command[4], out double latitude))
{
return;
}
try
{
BaseShape shape;
inMemoryFeatureLayer.Open(); //make sure the layer is open
if (inMemoryFeatureLayer.Name.Contains(GlobalVars.LineType)) if it's the layer holding line shapes
{
shape = (LineShape)inMemoryFeatureLayer.FeatureSource.GetFeatureById(featureName, ReturningColumnsType.NoColumns).GetShape();
((LineShape)shape).Vertices[index] = new Vertex(longitude, latitude); //set the vertex to a new location
}
else //it's the layer holdilng polygone shapes
{
shape = (PolygonShape)inMemoryFeatureLayer.FeatureSource.GetFeatureById(featureName, ReturningColumnsType.NoColumns).GetShape();
((PolygonShape)shape).OuterRing.Vertices[index] = new Vertex(longitude, latitude); //set the vertex to a new location
}
inMemoryFeatureLayer.EditTools.BeginTransaction();
inMemoryFeatureLayer.EditTools.Update(shape);
inMemoryFeatureLayer.EditTools.CommitTransaction();
map.Refresh(); //this won't happen because I suspend refreshes beforehand
}
catch (Exception ex)
{
//log it
}
}
5 Answers 5
You should take a look at the possibility to name groups in regex patterns; your match pattern can then be a oneliner:
const string pattern = @"^(?<name>\w+)\[(?<index>\d+)\]$";
Match match = Regex.Match(command[2], pattern);
if (match.Success)
{
string featureName = match.Groups["name"].Value;
int index = int.Parse(match.Groups["index"].Value);
try
{
// TODO: use the information
}
catch (Exception ex)
{
// TODO: Notify consumer
}
}
else
{
// TODO: notify consumer
}
As shown, the pattern for the index is changed from .*?
to \d+
which is more precise and hence more efficient because it doesn't need the non-greedy operator ?
.
if (command.Length < 5) //can't be more than 5 parameters { return; }
When meeting an error you just return from the method without doing anything. Shouldn't you notify the consumer/log the situation/condition?
You don't write from where the command
information comes from, but if you read it from a text file, I think I would parse the string information while reading the file and then create a set of command objects defined as something like:
public class CommandInfo
{
public string Action { get; set; }
public string Type { get; set; }
public string ListName { get; set; }
public int Index { get; set; }
public Point Vertex { get; set; }
}
Or whatever is suitable for your needs
This may not improve performance but it is more "type safe" than just dispatching an array of strings around.
According to efficiency:
You write, that you have thousands of commands for the same shape, but it seems that you requery a set of shapes inMemoryFeatureLayer.FeatureSource
for every command. Whouldn't it be more efficient if you group the commands for each shape, so that you only have to query for the shape once and then can execute all the related commands?
TIP: If you are using Visual Studio 2019, you get help from intellisense and color coding if you compose the regex pattern directly in the call to Match
or Matches
:
Match match = Regex.Match(command[2], @"^(?<name>\w+)\[(?<index>\d+)\]$");
UPDATE
FYI: there has been some discussion in the comments to various answers about which approach is the most efficient, when extracting the name of the list of vertices and the index from a string like command[2] = "testLine[10]"
=> "testLine" and 10
. So I ran some comparison tests for the following methods:
const string pattern = @"^(?<name>\w+)\[(?<index>\d+)\]$";
Regex regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.Singleline);
void DoRegex()
{
foreach (string item in data)
{
Match match = regex.Match(item);
if (match.Success)
{
string featureName = match.Groups["name"].Value;
int index = int.Parse(match.Groups["index"].Value);
testResult += index ^ featureName.GetHashCode();
}
else
{
throw new InvalidOperationException("DoRegex match");
}
}
}
void DoSplitString()
{
foreach (string item in data)
{
string[] split = item.Split('[', ']');
string featureName = split[0];
if (int.TryParse(split[1], out int index))
testResult += index ^ featureName.GetHashCode();
else
throw new InvalidOperationException("DoSplitString TryParse");
}
}
void DoIndexOf()
{
foreach (string item in data)
{
int endIndex = item.IndexOf('[');
string featureName = item.Substring(0, endIndex);
if (int.TryParse(item.Substring(endIndex + 1, item.Length - endIndex - 2), out int index))
testResult += index ^ featureName.GetHashCode();
else
throw new InvalidOperationException("DoIndexOf TryParse");
}
}
void DoStringBuilder()
{
foreach (string item in data)
{
string featureName = "";
int index = 0;
char stop = '[';
StringBuilder builder = new StringBuilder();
foreach (char ch in item)
{
if (ch == stop)
{
if (stop == '[')
{
featureName = builder.ToString();
stop = ']';
}
else
{
if (int.TryParse(builder.ToString(), out index))
testResult += index ^ featureName.GetHashCode();
else
throw new InvalidOperationException("DoStringBuilder TryParse");
break;
}
builder.Clear();
}
else
{
builder.Append(ch);
}
}
}
}
I use a compiled instance for Regex
, which reduces the duration for this test with about 600 ms compared to an uncompiled version - when it is compiled once and the same instance is reused for all tests. testResult
serves as a simple validation.
The dataset is generated as (the data creation is not part of the measures):
const int testCount = 2000;
const int randSeed = 5;
Random rand = new Random(randSeed);
IList<string> data;
int testResult = 0;
IEnumerable<string> CreateData(int count)
{
string chars = "abcdefghijklmnopqrstuwxyz";
for (int i = 0; i < count; i++)
{
yield return $"{(string.Join("", Enumerable.Range(1, rand.Next(10, 30)).Select(n => chars[rand.Next(0, chars.Length)])))}[{rand.Next(0, 100)}]";
}
}
public void Run()
{
data = CreateData(1000).ToList();
...
I ran each method 2000 times in order to measure some realistic averages with the same dataset with 1000 strings for all measures. The result is rather disappointing for the Regex
approach:
testResult: 2052366768
Result for test DoSplitString:
Iterations: 2000
Average: 0,34116 Milliseconds
Min: 0,29660 Milliseconds
Max: 1,17520 Milliseconds
Total: 682,31420 Milliseconds
testResult: 2052366768
Result for test DoRegex:
Iterations: 2000
Average: 1,69160 Milliseconds
Min: 1,48380 Milliseconds
Max: 6,16170 Milliseconds
Total: 3383,19930 Milliseconds
testResult: 2052366768
Result for test DoIndexOf:
Iterations: 2000
Average: 0,22634 Milliseconds
Min: 0,19200 Milliseconds
Max: 0,93010 Milliseconds
Total: 452,68460 Milliseconds
testResult: 2052366768
Result for test DoStringBuilder:
Iterations: 2000
Average: 0,36898 Milliseconds
Min: 0,33100 Milliseconds
Max: 1,36560 Milliseconds
Total: 737,96910 Milliseconds
IndexOf
is by far the most efficient
As t3chb0t writes in his comments, the regex pattern can be optimized significantly to
const string pattern = @"^(?<name>.+)\[(?<index>.+)\]$";
static readonly Regex regex = new Regex(pattern, RegexOptions.Compiled);
Running the test with this pattern gives this result:
testResult: 2052366768
Result for test DoRegex:
Iterations: 2000
Average: 0,97967 Milliseconds
Truncated Average: 0,97777 Milliseconds
Min: 0,84020 Milliseconds
Max: 4,91550 Milliseconds
Total: 1959,33570 Milliseconds
-
1\$\begingroup\$ I see you've changed the evil wildcard match
.*?
to\d+
- it'd be good to mention that this is more efficient and also more exact. I believe this is the catastrophic backtracking or here \$\endgroup\$t3chb0t– t3chb0t2019年07月02日 06:41:37 +00:00Commented Jul 2, 2019 at 6:41 -
1\$\begingroup\$ About regex coloring; this is also possible for any string when you comment it with jetbrains helpers; see Regular Expressions Assistance \$\endgroup\$t3chb0t– t3chb0t2019年07月02日 07:45:31 +00:00Commented Jul 2, 2019 at 7:45
-
1\$\begingroup\$ Your
DoIndexOf
is cheating becauseDoRegex
must do more work as you put such constraints on it as\w+
which the former does not do. When you use(.+)
the regex becomes rougly 300ms faster for 1mln strings. Alternatively add a validation loop that would work the same as\w+
basically in every other method. \$\endgroup\$t3chb0t– t3chb0t2019年07月03日 06:37:22 +00:00Commented Jul 3, 2019 at 6:37 -
1\$\begingroup\$ The fastest and most fair regex is
@"^(.+)\[(.+)\]"
created like thisnew Regex(pattern, RegexOptions.Compiled);
which additionaly improves its performance by another 300ms asSingleLine
adds a check agianst\n
. Removing thematch.Success
improves the regex method by another 30-50ms. \$\endgroup\$t3chb0t– t3chb0t2019年07月03日 06:44:58 +00:00Commented Jul 3, 2019 at 6:44 -
1\$\begingroup\$ If you made a second answer with the comparative review, you would have gotten 2 +1's from me :p \$\endgroup\$dfhwze– dfhwze2019年07月03日 07:04:53 +00:00Commented Jul 3, 2019 at 7:04
One possible place you can improve performance is here:
const string pattern = @"\[(.*?)\]";
var query = command[2];
var matches = Regex.Matches(query, pattern); //Gets anything inside the brackets
index = Convert.ToInt32(matches[0].Groups[1].Value); //should be an int
featureName = command[2].Substring(0, command[2].IndexOf('[')).ToUpper();
You're parsing the same string twice. Using the Split
method for such a simple parsing would probably improve that:
var separators = new char[] { '[', ']' };
var parts = command[2].Split(separators,StringSplitOptions.RemoveEmptyEntries);
index = Convert.ToInt32(parts[1]);
featureName = parts[0].ToUpper();
Why would you want to use Regular Expressions anyway if the base format of the string is always the same?
Name[Number]
seems like an easy pattern. Just iterate through the characters one by one and store all characters in the first string until you reach the first bracket. Then store the numbers in the second string (until you reach the closing bracket). Getting rid of the Regular Expression should make it faster.
Hash all string constants
I'm presuming you only have a limited number of keywords ("move", "vertex" etc.). Hash all of those with something fast - CRC32 is perfectly adequate.
Split the string into space-separated tokens, as usual, and calculate the hash of the first and second tokens. Then you just need to compare the calculated hash with the hashes of the available commands. If you have a large number of keywords, this can yield very significant improvements.
Ditch the "Contains" and the regex
To check whether it contains a bracket, the code has to scan over the entire variable name. It then scans the entire string again to find the start and end of the brackets. And the processing necessary for the regex pattern-matching is not fast.
All you actually need is "IndexOf". If there's no bracket, IndexOf returns -1. If there's a bracket, you've found the start of the number. A second "IndexOf" to look for the closing bracket, starting at that index, will give you the end (or -1 if there isn't a closing bracket). Then all you need to do is copy the substring.
-
1\$\begingroup\$ Don't ditch one thing to replace it with a another similar, albeit improved thing. Go with Florian's approach of parsing in a single pass instead. en.wikipedia.org/wiki/Out_of_the_frying_pan_into_the_fire \$\endgroup\$dfhwze– dfhwze2019年07月02日 15:51:45 +00:00Commented Jul 2, 2019 at 15:51
-
1\$\begingroup\$ @dfhwze I'd disagree: I expect making a readable and efficient version of Florian's suggestion will be much more effort than throwing together a pair of
IndexOf
andSubstring
calls. \$\endgroup\$VisualMelon– VisualMelon2019年07月02日 15:59:25 +00:00Commented Jul 2, 2019 at 15:59 -
\$\begingroup\$ @VisualMelon it all boils down how crucial performance is. I figured the OP holds performance very high, so Florian's approach is the superior one. \$\endgroup\$dfhwze– dfhwze2019年07月02日 16:01:29 +00:00Commented Jul 2, 2019 at 16:01
-
\$\begingroup\$ @dfhwze I don't think that is a given: I'd personally expect this one to be faster, because of the added overheads of dealing with a
StringBuilder
(or similar) implied in Florin's approach, and substring and indexof have dodgy implementations underneath with special support for ASCII and such. If I can be bothered, I might put a benchmark together later (of course anyone else would be welcome to do so before me) since that's as close to an answer as we'll get without the rest of the OP's code. \$\endgroup\$VisualMelon– VisualMelon2019年07月02日 16:05:41 +00:00Commented Jul 2, 2019 at 16:05 -
\$\begingroup\$ (Of course, no-one would do that, because it is silly; they would loop through to find the brackets and then use
SubString
, at which point the question is whether a loop is more efficient thanIndexOf
or not (which I'd expect depends on a few things)) \$\endgroup\$VisualMelon– VisualMelon2019年07月02日 16:08:29 +00:00Commented Jul 2, 2019 at 16:08
The first obvious optimization would be to not call Regex.Matches(string, string)
which compiles the regular expression every single time. Compiling the regex is an expensive operation.
Instead, create a Regex
(which compiles the expression exactly once) and keep it around during your several thousand invocations. That should by itself do the trick (because, well, a few thousand matches is really no challenge to a modern computer).
Also, if you are desperate, you could do without regex. You could switch by string length, which only leaves you to distinguish between x
and y
(all others have distinct string lengths).
Or, you could look at the first character (or the first, and another one, if you maybe have more keywords). Some minimal perfect hash generators (I forgot which one, gperf maybe?) apply that trick for very quickly matching a known set of keywords. Brackets are only possible in a particular context, so you don't really need to match for them all the time.
RegexOptions.Compiled
option \$\endgroup\$