I'm interested in parsing and extracting the IDs found in a document. The IDs are found throughout the document in the this format:
IDStart=1FG3392MQ9&IDEnd
The "1FG3392MQ9" is an example of such an ID (it will always appear between the IDStart and IDEnd tokens as in the above example). There will be text before and possibly after each occurrence and they could appear once on a given line or many times on a line.
Here's a utility function I have to parse and extract them:
Public Shared Function ParseALLDocumentIDs(text As String) As List(Of String)
Dim allParsedIDs As New List(Of String)
Dim IDregex As String = "IDStart=(.+?)&IDEnd"
Dim matches As MatchCollection = Regex.Matches(text, IDregex, RegexOptions.IgnoreCase Or RegexOptions.Multiline)
For Each m As Match In matches
If m.Success Then
Dim parsedID = m.GetFristCapture()
If Not String.IsNullOrWhiteSpace(parsedID) Then
allParsedIDs.Add(parsedID)
End If
End If
Next
Return allParsedIDs
End Function
Upon each match, the following extension method is used (defined in another module) to get the first group capture of a match:
<Extension()>
Public Function GetFristCapture(m As Match) As String
' Gets the first group capture "(...)" in a regex match.
' Returns empty string if not a match or there is no group capture found
Dim capturedValue As String = ""
If m IsNot Nothing AndAlso m.Success AndAlso m.Groups.Count > 1 Then
' Note: The first group is the entire match itself,
' so Group(1) is what we want, not Group(0)
If m.Groups(1).Success Then
capturedValue = m.Groups(1).Value.Trim
End If
End If
Return capturedValue
End Function
Questions:
How does the Regex look? Am I missing anything? The idea was to capture all text between any occurrences of those two tokens, and do it non-greedily so that it doesn't capture across multiple tokens.
How does overall approach look with getting these IDs? Is there a cleaner way?
2 Answers 2
Currently, you're extracting (non-greedily) everything appearing between the IDStart=
and &IDEnd
. Since this is an ID, and generally the IDs are stored in alphanumeric character sets; you can easily replace the .*?
with a [a-z0-9]+
pattern. The &
of the &IDEnd
will take care of limiting your result between those two words alone.
However, if that is not your case, you should still replace the .*?
with .+?
so that you won't receive results/captures for a string like: IDStart=&IDEnd
.
If the first rule applied to your case, then you can also remove the .Trim
call from your GetFristCapture
function.
NOTE: I assume that GetFristCapture
is a typo. If not, you should rename it to GetFirstCapture
.
The the GetFristCapture
function itself, do not create a temporary variable. Just return from the if
block itself:
Public Function GetFristCapture(m As Match) As String
' Gets the first group capture "(...)" in a regex match.
' Returns empty string if not a match or there is no group capture found
If m IsNot Nothing AndAlso m.Success AndAlso m.Groups.Count > 1 Then
' Note: The first group is the entire match itself,
' so Group(1) is what we want, not Group(0)
return m.Groups(1).Value
End If
Return ""
End Function
Note that I removed the nested if
block as well. With your condition m.Groups.Count > 1
, you're obviously getting Group(0)
and Group(1)
. Just return Group(1)
from there.
It's all a bit complicated in my opinion. Perhaps you could do it like this, without the need for regex at all.
Public Shared Function GetDocumentIds(input As String) As List(Of String)
Dim results = New List(Of String)()
Dim sections = input.Split(New String() {"IDStart="}, StringSplitOptions.None)
For Each section In sections
Dim result = section.Split(New String() {"&IDEnd"}, StringSplitOptions.None)(0)
If Not String.IsNullOrWhiteSpace(result) Then
results.Add(result)
End If
Next
Return results
End Function