5
\$\begingroup\$

I am reading a ~120 MB log file (~300 Million lines of text).

... 
30/05/2014 15:07:18 0000030-0000738 Net SourceID 11000006
30/05/2014 15:07:18 0000030-0000738 Communication RouteID 108
30/05/2014 15:14:44 0000030-0002027 Communication Route_Complete Already Completed
30/05/2014 15:07:40 0000030-0000739 Communication RouteID Selected 108
...

Each line is created programatically, so the MakeLogLine function can extract the exact information needed.

MakeLogLine

Private Function MakeLogLine(ByVal l As String, Optional ByVal lineNum As Integer = 0) As LogLine
 ' Trim any leading / trailing whitespace.'
 l = l.Trim()
 ' Extracts route Id separately to keep track of the ones that have been tagged.'
 Dim rid As String = ExtractRouteId(l)
 If rid IsNot Nothing Then
 Call Me.linesWithRouteIds.Add(New rIdLines With {.line = line, .rid = rid})
 End If
 Return New LogLine With {.lstr = l,
 .threadId = ExtractThreadId(l),
 .requestId = ExtractRequestId(l),
 .lineNum = lineNum,
 .sourceId = ExtractSourceId(l),
 .performanceId = ExtractPerformanceId(l),
 .routeId = rid,
 .methodName = ExtractMethodName(l),
 .time = ExtractTime(l)}
End Function

ExtractRouteId

An example of how I extract the text from the line.

Private Function ExtractRouteId(ByVal l As String) As String
 Dim rid As String
 If l.Length < 59 Then Return Nothing
 rid = l.Substring(50)
 If rid.Length < "RouteID X".Length Then Return Nothing
 If Not rid(0) = "R"c Then Return Nothing
 If Not rid(5) = "I"c Then
 ' Catches Route_Completed .. lines.'
 Return Nothing
 End If
 ' Assume its a RouteID by this stage.'
 ' Trim the RouteID from the string.'
 rid = rid.Substring(8, rid.Length - 8)
 ' For RouteID Selected.'
 Try
 If rid.Length > 8 AndAlso rid(0) = "S"c Then Return rid.Substring(9, rid.Length - 9)
 Catch ex As Exception
 Return rid
 End Try
 Return rid
End Function

GetAllLines

''' <summary>
''' Gets all the lines in the file at the path, puts them in the list of lines.
''' </summary>
''' <returns>True if successful, false if not.</returns>
Public Function GetAllLines() As Boolean
 If Not CheckFile() Then Return Nothing
 Const MaxLineSize As Integer = 1024 * 4
 Dim buffer(MaxLineSize - 1) As Char
 Dim charCount As Integer
 Dim builder As New System.Text.StringBuilder(62)
 Dim Split As String()
 Dim lineNum As Integer = 0
 ' Create a new list of log lines.'
 Me.lines = New List(Of LogLine)
 Using reader As New System.IO.StreamReader(Me.filename)
 Do Until reader.EndOfStream
 'Read the next blocksize(4)KB of the file.'
 charCount = reader.Read(buffer, 0, MaxLineSize)
 With builder
 Call .Clear()
 Call .Append(buffer, 0, charCount)
 If charCount = MaxLineSize Then
 'Read up to the next line break and append to the block just read.'
 Call .Append(reader.ReadLine())
 End If
 Split = .ToString.Split(CChar(Environment.NewLine))
 End With
 For LineNo As Integer = 0 To Split.Length - 1
 Call Me.lines.Add(MakeLogLine(Split(LineNo), lineNum))
 lineNum += 1
 Next
 Loop
 End Using
 ' Successful.'
 Return True
End Function


My Current Run times:

For a 120MB file, with 2,076,313 lines is 7.2 seconds. For a 200MB file, with 3,357,072 lines is 12.4 seconds.


GetAllLines
I am reading the file in chunks then appending new lines where necessary. This is the fastest way I have found.
But 7 Seconds is still too slow.

How can I improve on this?

ExtractRouteId
I am trying to eliminate as many calculations as possible, but is there any way I can improve upon this?

RubberDuck
31.1k6 gold badges73 silver badges176 bronze badges
asked Sep 23, 2014 at 11:16
\$\endgroup\$
7
  • 1
    \$\begingroup\$ I'm not sure what it would do for speed, but any reason you're not using the ReadLine method? \$\endgroup\$ Commented Sep 23, 2014 at 11:48
  • 2
    \$\begingroup\$ You don't need to write Call \$\endgroup\$ Commented Sep 23, 2014 at 12:43
  • \$\begingroup\$ @the_lotus why do I not need to write Call? \$\endgroup\$ Commented Sep 23, 2014 at 13:07
  • \$\begingroup\$ @JamesThorpe I have just tested the ReadLine method and the performance decreases by 0.5s for the 120MB file and .8s for the 200MB file. \$\endgroup\$ Commented Sep 23, 2014 at 13:09
  • 1
    \$\begingroup\$ Welcome to Code Review! Please do not add, remove, or edit code in a question after you've received an answer. Please see the meta question: What you can and cannot do after receiving answers. \$\endgroup\$ Commented Sep 23, 2014 at 16:20

1 Answer 1

7
\$\begingroup\$

Good

The shortcircuit condition aka AndAlso is used.
The StreamReader is enclosed inside a using statement.
Magic numbers are hidden behind const.
StringBuilder is constructed using a initial size.

Bad
StringBuilder is constructed using a initial size which is not high enough.
Using of Call, which is kept only for backward compatibility.
Returning Nothing instead of False (GetAllLines).
Returning Nothing instead of String.Empty (ExtractRouteId).
Using of single character parameters.

Inconsistent parameter naming
Sometimes camelCased which is the right choice regarding the naming convention, but sometime PascalCased which is wrong for parameters/variables.

Refactoring

Let us start with the GetAllLines() method. If we read the methodname we would expect that the method will return lines, but instead it returns True or Nothing.
So for testing purpose we should change the return type to a List(Of LogLine) and add an inputparameter fileName As String. In this way the method won't be coupled tightly to the Me object what as a guess will be a form.
(削除) The variable lineNum is only used to count the added LogLines which is useless, so let us remove it. (削除ここまで)
As we use Split = .ToString.Split(CChar(Environment.NewLine)) at each iteration of the Do until loop, we should extract CChar(Environment.NewLine) somehow outside the loop.
We should also rename the array Split to something more meaningful and something which isn't a known methodname.
Removing the With builder .. End With will save us one indentionlevel which makes our code more readable.
Last the CheckFile() method should be renamed to IsFileValid() and should have an inputparameter fileName As String.

Public Function GetAllLines(fileName As String) As List(Of LogLine)
 Dim logLines As New List(Of LogLine)()
 If Not CheckFile(fileName) Then Return logLines
 Const MaxLineSize As Integer = 4096
 Const stringBuilderCapacity As Integer = 8192
 Dim buffer(MaxLineSize - 1) As Char
 Dim charCount As Integer
 Dim builder As New System.Text.StringBuilder(stringBuilderCapacity)
 Dim singleLines As String()
 Dim newLineChars As Char() = Environment.NewLine.ToCharArray()
 Dim lineNumber As Integer = 0
 Using reader As New System.IO.StreamReader(fileName)
 Do Until reader.EndOfStream 
 charCount = reader.Read(buffer, 0, MaxLineSize)
 builder.Append(buffer, 0, charCount)
 If reader.Peek() >= 0 Then
 builder.Append(reader.ReadLine())
 End If
 singleLines = builder.ToString().Split(newLineChars)
 For Each singleLine As String In singleLines
 logLines.Add(MakeLogLine(singleLine, lineNumber))
 lineNumber += 1
 Next
 builder.Clear()
 Loop
 End Using
 Return logLines
End Function

Now let us take a look at the ExtractRouteId() method.
We will for readability change the name of the inputparameter to currentLine.
For the same purpose we will rename rid to routeId.
As we could count the length of RouteID X by hand, we should create a const out of our result and use it, but as we return String.Empty if the Length < 59 we won't need to check if rid.Length < "RouteID X" at all as this will be true.

We are only interested in the part of the string which starts at an specified index, so we can use the overloaded SubString() method which takes only the startindex and returns the string starting at this index till the end.

Private Const minRouteIdLength As Integer = 9
Private Function ExtractRouteId(ByVal currentLine As String) As String
 Dim routeId As String = String.Empty
 If currentLine.Length < 59 OrElse _
 Not (currentLine(50) = "R"c) OrElse _
 Not (currentLine(55) = "I"c) Then
 Return routeId
 End If
 routeId = currentLine.Substring(58)
 If routeId.Length > minRouteIdLength AndAlso routeId(0) = "S"c Then
 routeId = routeId.Substring(minRouteIdLength)
 End If
 Return routeId
End Function

As I need to leave the office now, I will just place the MakeLogLine() method here and will add some explanation tomorrow.

Private Function MakeLogLine(ByVal currentLine As String, Optional lineNumber As Integer = 0) As LogLine
 currentLine = currentLine.Trim()
 ' Extracts route Id separately to keep track of the ones that have been tagged.'
 Dim routeId As String = ExtractRouteId(currentLine)
 If String.IsNullOrEmpty(routeId) Then
 Me.linesWithRouteIds.Add(New rIdLines With {.line = line, .rid = routeId})
 End If
 Return New LogLine With {.lstr = currentLine,
 .threadId = ExtractThreadId(currentLine),
 .requestId = ExtractRequestId(currentLine),
 .lineNum = lineNumber,
 .sourceId = ExtractSourceId(currentLine),
 .performanceId = ExtractPerformanceId(currentLine),
 .routeId = routeId ,
 .methodName = ExtractMethodName(currentLine),
 .time = ExtractTime(currentLine)}
End Function
answered Sep 23, 2014 at 13:34
\$\endgroup\$
2
  • \$\begingroup\$ Thank you so much for going through everything! Could you tell me the advantage of using String.IsNullOrEmpty and String.empty, over Nothing. I have also optimized the other functions based on your above recommendations, although I was struggling to see how I could improve ExtractTime, could you please have a look above? Thankyou! ~Harry \$\endgroup\$ Commented Sep 23, 2014 at 16:12
  • \$\begingroup\$ @Mumf For a string this does't do any difference, but for keeping the code consistent, as you e.g can't have a 'Nothing' as a boolean like in the GetAllLines() method. \$\endgroup\$ Commented Sep 24, 2014 at 5:33

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.