Parsing and extracting all IDs from a document (using regex with capture)

Question 1

I'm interested in parsing and extracting the IDs found in a document. The IDs are found throughout the document in the this format:

IDStart=1FG3392MQ9&IDEnd

The "1FG3392MQ9" is an example of such an ID (it will always appear between the IDStart and IDEnd tokens as in the above example). There will be text before and possibly after each occurrence and they could appear once on a given line or many times on a line.

Here's a utility function I have to parse and extract them:

Public Shared Function ParseALLDocumentIDs(text As String) As List(Of String)
 Dim allParsedIDs As New List(Of String)
 Dim IDregex As String = "IDStart=(.+?)&IDEnd"
 Dim matches As MatchCollection = Regex.Matches(text, IDregex, RegexOptions.IgnoreCase Or RegexOptions.Multiline)
 For Each m As Match In matches
 If m.Success Then
 Dim parsedID = m.GetFristCapture()
 If Not String.IsNullOrWhiteSpace(parsedID) Then
 allParsedIDs.Add(parsedID)
 End If
 End If
 Next
 Return allParsedIDs 
 End Function

Upon each match, the following extension method is used (defined in another module) to get the first group capture of a match:

<Extension()>
 Public Function GetFristCapture(m As Match) As String
 ' Gets the first group capture "(...)" in a regex match.
 ' Returns empty string if not a match or there is no group capture found
 Dim capturedValue As String = ""
 If m IsNot Nothing AndAlso m.Success AndAlso m.Groups.Count > 1 Then
 ' Note: The first group is the entire match itself,
 ' so Group(1) is what we want, not Group(0)
 If m.Groups(1).Success Then
 capturedValue = m.Groups(1).Value.Trim
 End If
 End If
 Return capturedValue
 End Function

Questions:

How does the Regex look? Am I missing anything? The idea was to capture all text between any occurrences of those two tokens, and do it non-greedily so that it doesn't capture across multiple tokens.
How does overall approach look with getting these IDs? Is there a cleaner way?

Question 2

Currently, you're extracting (non-greedily) everything appearing between the IDStart= and &IDEnd. Since this is an ID, and generally the IDs are stored in alphanumeric character sets; you can easily replace the .*? with a [a-z0-9]+ pattern. The & of the &IDEnd will take care of limiting your result between those two words alone.

However, if that is not your case, you should still replace the .*? with .+? so that you won't receive results/captures for a string like: IDStart=&IDEnd.

If the first rule applied to your case, then you can also remove the .Trim call from your GetFristCapture function.

NOTE: I assume that GetFristCapture is a typo. If not, you should rename it to GetFirstCapture.

The the GetFristCapture function itself, do not create a temporary variable. Just return from the if block itself:

Public Function GetFristCapture(m As Match) As String
 ' Gets the first group capture "(...)" in a regex match.
 ' Returns empty string if not a match or there is no group capture found
 If m IsNot Nothing AndAlso m.Success AndAlso m.Groups.Count > 1 Then
 ' Note: The first group is the entire match itself,
 ' so Group(1) is what we want, not Group(0)
 return m.Groups(1).Value
 End If
 Return ""
End Function

Note that I removed the nested if block as well. With your condition m.Groups.Count > 1, you're obviously getting Group(0) and Group(1). Just return Group(1) from there.

Question 3

It's all a bit complicated in my opinion. Perhaps you could do it like this, without the need for regex at all.

Public Shared Function GetDocumentIds(input As String) As List(Of String)
 Dim results = New List(Of String)()
 Dim sections = input.Split(New String() {"IDStart="}, StringSplitOptions.None)
 For Each section In sections
 Dim result = section.Split(New String() {"&IDEnd"}, StringSplitOptions.None)(0)
 If Not String.IsNullOrWhiteSpace(result) Then
 results.Add(result)
 End If
 Next
 Return results
End Function

hjpotter92 hjpotter92 8,9011 gold badge26 silver badges49 bronze badges · Answer 1 · 2016-09-22 05:34:43Z

Currently, you're extracting (non-greedily) everything appearing between the IDStart= and &IDEnd. Since this is an ID, and generally the IDs are stored in alphanumeric character sets; you can easily replace the .*? with a [a-z0-9]+ pattern. The & of the &IDEnd will take care of limiting your result between those two words alone.

However, if that is not your case, you should still replace the .*? with .+? so that you won't receive results/captures for a string like: IDStart=&IDEnd.

If the first rule applied to your case, then you can also remove the .Trim call from your GetFristCapture function.

NOTE: I assume that GetFristCapture is a typo. If not, you should rename it to GetFirstCapture.

The the GetFristCapture function itself, do not create a temporary variable. Just return from the if block itself:

Public Function GetFristCapture(m As Match) As String
 ' Gets the first group capture "(...)" in a regex match.
 ' Returns empty string if not a match or there is no group capture found
 If m IsNot Nothing AndAlso m.Success AndAlso m.Groups.Count > 1 Then
 ' Note: The first group is the entire match itself,
 ' so Group(1) is what we want, not Group(0)
 return m.Groups(1).Value
 End If
 Return ""
End Function

Note that I removed the nested if block as well. With your condition m.Groups.Count > 1, you're obviously getting Group(0) and Group(1). Just return Group(1) from there.

gamesmad gamesmad 3361 silver badge5 bronze badges · Answer 2 · 2016-09-22 14:17:31Z

It's all a bit complicated in my opinion. Perhaps you could do it like this, without the need for regex at all.

Public Shared Function GetDocumentIds(input As String) As List(Of String)
 Dim results = New List(Of String)()
 Dim sections = input.Split(New String() {"IDStart="}, StringSplitOptions.None)
 For Each section In sections
 Dim result = section.Split(New String() {"&IDEnd"}, StringSplitOptions.None)(0)
 If Not String.IsNullOrWhiteSpace(result) Then
 results.Add(result)
 End If
 Next
 Return results
End Function

Stack Exchange Network

Parsing and extracting all IDs from a document (using regex with capture)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing and extracting all IDs from a document (using regex with capture)

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions