3
\$\begingroup\$

I'm interested in parsing and extracting the IDs found in a document. The IDs are found throughout the document in the this format:

IDStart=1FG3392MQ9&IDEnd

The "1FG3392MQ9" is an example of such an ID (it will always appear between the IDStart and IDEnd tokens as in the above example). There will be text before and possibly after each occurrence and they could appear once on a given line or many times on a line.

Here's a utility function I have to parse and extract them:

Public Shared Function ParseALLDocumentIDs(text As String) As List(Of String)
 Dim allParsedIDs As New List(Of String)
 Dim IDregex As String = "IDStart=(.+?)&IDEnd"
 Dim matches As MatchCollection = Regex.Matches(text, IDregex, RegexOptions.IgnoreCase Or RegexOptions.Multiline)
 For Each m As Match In matches
 If m.Success Then
 Dim parsedID = m.GetFristCapture()
 If Not String.IsNullOrWhiteSpace(parsedID) Then
 allParsedIDs.Add(parsedID)
 End If
 End If
 Next
 Return allParsedIDs 
 End Function

Upon each match, the following extension method is used (defined in another module) to get the first group capture of a match:

<Extension()>
 Public Function GetFristCapture(m As Match) As String
 ' Gets the first group capture "(...)" in a regex match.
 ' Returns empty string if not a match or there is no group capture found
 Dim capturedValue As String = ""
 If m IsNot Nothing AndAlso m.Success AndAlso m.Groups.Count > 1 Then
 ' Note: The first group is the entire match itself,
 ' so Group(1) is what we want, not Group(0)
 If m.Groups(1).Success Then
 capturedValue = m.Groups(1).Value.Trim
 End If
 End If
 Return capturedValue
 End Function

Questions:

  1. How does the Regex look? Am I missing anything? The idea was to capture all text between any occurrences of those two tokens, and do it non-greedily so that it doesn't capture across multiple tokens.

  2. How does overall approach look with getting these IDs? Is there a cleaner way?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Sep 21, 2016 at 18:34
\$\endgroup\$

2 Answers 2

1
\$\begingroup\$

Currently, you're extracting (non-greedily) everything appearing between the IDStart= and &IDEnd. Since this is an ID, and generally the IDs are stored in alphanumeric character sets; you can easily replace the .*? with a [a-z0-9]+ pattern. The & of the &IDEnd will take care of limiting your result between those two words alone.

However, if that is not your case, you should still replace the .*? with .+? so that you won't receive results/captures for a string like: IDStart=&IDEnd.

If the first rule applied to your case, then you can also remove the .Trim call from your GetFristCapture function.

NOTE: I assume that GetFristCapture is a typo. If not, you should rename it to GetFirstCapture.

The the GetFristCapture function itself, do not create a temporary variable. Just return from the if block itself:

Public Function GetFristCapture(m As Match) As String
 ' Gets the first group capture "(...)" in a regex match.
 ' Returns empty string if not a match or there is no group capture found
 If m IsNot Nothing AndAlso m.Success AndAlso m.Groups.Count > 1 Then
 ' Note: The first group is the entire match itself,
 ' so Group(1) is what we want, not Group(0)
 return m.Groups(1).Value
 End If
 Return ""
End Function

Note that I removed the nested if block as well. With your condition m.Groups.Count > 1, you're obviously getting Group(0) and Group(1). Just return Group(1) from there.

answered Sep 22, 2016 at 5:34
\$\endgroup\$
0
\$\begingroup\$

It's all a bit complicated in my opinion. Perhaps you could do it like this, without the need for regex at all.

Public Shared Function GetDocumentIds(input As String) As List(Of String)
 Dim results = New List(Of String)()
 Dim sections = input.Split(New String() {"IDStart="}, StringSplitOptions.None)
 For Each section In sections
 Dim result = section.Split(New String() {"&IDEnd"}, StringSplitOptions.None)(0)
 If Not String.IsNullOrWhiteSpace(result) Then
 results.Add(result)
 End If
 Next
 Return results
End Function
answered Sep 22, 2016 at 14:17
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.