I am reading a ~120 MB log file (~300 Million lines of text).
...
30/05/2014 15:07:18 0000030-0000738 Net SourceID 11000006
30/05/2014 15:07:18 0000030-0000738 Communication RouteID 108
30/05/2014 15:14:44 0000030-0002027 Communication Route_Complete Already Completed
30/05/2014 15:07:40 0000030-0000739 Communication RouteID Selected 108
...
Each line is created programatically, so the MakeLogLine
function can extract the exact information needed.
MakeLogLine
Private Function MakeLogLine(ByVal l As String, Optional ByVal lineNum As Integer = 0) As LogLine
' Trim any leading / trailing whitespace.'
l = l.Trim()
' Extracts route Id separately to keep track of the ones that have been tagged.'
Dim rid As String = ExtractRouteId(l)
If rid IsNot Nothing Then
Call Me.linesWithRouteIds.Add(New rIdLines With {.line = line, .rid = rid})
End If
Return New LogLine With {.lstr = l,
.threadId = ExtractThreadId(l),
.requestId = ExtractRequestId(l),
.lineNum = lineNum,
.sourceId = ExtractSourceId(l),
.performanceId = ExtractPerformanceId(l),
.routeId = rid,
.methodName = ExtractMethodName(l),
.time = ExtractTime(l)}
End Function
ExtractRouteId
An example of how I extract the text from the line.
Private Function ExtractRouteId(ByVal l As String) As String
Dim rid As String
If l.Length < 59 Then Return Nothing
rid = l.Substring(50)
If rid.Length < "RouteID X".Length Then Return Nothing
If Not rid(0) = "R"c Then Return Nothing
If Not rid(5) = "I"c Then
' Catches Route_Completed .. lines.'
Return Nothing
End If
' Assume its a RouteID by this stage.'
' Trim the RouteID from the string.'
rid = rid.Substring(8, rid.Length - 8)
' For RouteID Selected.'
Try
If rid.Length > 8 AndAlso rid(0) = "S"c Then Return rid.Substring(9, rid.Length - 9)
Catch ex As Exception
Return rid
End Try
Return rid
End Function
GetAllLines
''' <summary>
''' Gets all the lines in the file at the path, puts them in the list of lines.
''' </summary>
''' <returns>True if successful, false if not.</returns>
Public Function GetAllLines() As Boolean
If Not CheckFile() Then Return Nothing
Const MaxLineSize As Integer = 1024 * 4
Dim buffer(MaxLineSize - 1) As Char
Dim charCount As Integer
Dim builder As New System.Text.StringBuilder(62)
Dim Split As String()
Dim lineNum As Integer = 0
' Create a new list of log lines.'
Me.lines = New List(Of LogLine)
Using reader As New System.IO.StreamReader(Me.filename)
Do Until reader.EndOfStream
'Read the next blocksize(4)KB of the file.'
charCount = reader.Read(buffer, 0, MaxLineSize)
With builder
Call .Clear()
Call .Append(buffer, 0, charCount)
If charCount = MaxLineSize Then
'Read up to the next line break and append to the block just read.'
Call .Append(reader.ReadLine())
End If
Split = .ToString.Split(CChar(Environment.NewLine))
End With
For LineNo As Integer = 0 To Split.Length - 1
Call Me.lines.Add(MakeLogLine(Split(LineNo), lineNum))
lineNum += 1
Next
Loop
End Using
' Successful.'
Return True
End Function
My Current Run times:
For a 120MB file, with 2,076,313 lines is 7.2 seconds. For a 200MB file, with 3,357,072 lines is 12.4 seconds.
GetAllLines
I am reading the file in chunks then appending new lines where necessary.
This is the fastest way I have found.
But 7 Seconds is still too slow.
How can I improve on this?
ExtractRouteId
I am trying to eliminate as many calculations as possible, but is there any way I can improve upon this?
-
1\$\begingroup\$ I'm not sure what it would do for speed, but any reason you're not using the ReadLine method? \$\endgroup\$James Thorpe– James Thorpe2014年09月23日 11:48:56 +00:00Commented Sep 23, 2014 at 11:48
-
2\$\begingroup\$ You don't need to write Call \$\endgroup\$the_lotus– the_lotus2014年09月23日 12:43:08 +00:00Commented Sep 23, 2014 at 12:43
-
\$\begingroup\$ @the_lotus why do I not need to write Call? \$\endgroup\$Harry Mumford-Turner– Harry Mumford-Turner2014年09月23日 13:07:54 +00:00Commented Sep 23, 2014 at 13:07
-
\$\begingroup\$ @JamesThorpe I have just tested the ReadLine method and the performance decreases by 0.5s for the 120MB file and .8s for the 200MB file. \$\endgroup\$Harry Mumford-Turner– Harry Mumford-Turner2014年09月23日 13:09:32 +00:00Commented Sep 23, 2014 at 13:09
-
1\$\begingroup\$ Welcome to Code Review! Please do not add, remove, or edit code in a question after you've received an answer. Please see the meta question: What you can and cannot do after receiving answers. \$\endgroup\$RubberDuck– RubberDuck2014年09月23日 16:20:58 +00:00Commented Sep 23, 2014 at 16:20
1 Answer 1
Good
The shortcircuit condition aka AndAlso
is used.
The StreamReader
is enclosed inside a using
statement.
Magic numbers are hidden behind const.
StringBuilder
is constructed using a initial size.
Bad
StringBuilder
is constructed using a initial size which is not high enough.
Using of Call, which is kept only for backward compatibility.
Returning Nothing
instead of False
(GetAllLines).
Returning Nothing
instead of String.Empty
(ExtractRouteId).
Using of single character parameters.
Inconsistent parameter naming
Sometimes camelCased
which is the right choice regarding the naming convention, but sometime PascalCased
which is wrong for parameters/variables.
Refactoring
Let us start with the GetAllLines()
method. If we read the methodname we would expect that the method will return lines, but instead it returns True
or Nothing
.
So for testing purpose we should change the return type to a List(Of LogLine)
and add an inputparameter fileName As String
. In this way the method won't be coupled tightly to the Me
object what as a guess will be a form.
(削除) The variable lineNum
is only used to count the added LogLines
which is useless, so let us remove it. (削除ここまで)
As we use Split = .ToString.Split(CChar(Environment.NewLine))
at each iteration of the Do until
loop, we should extract CChar(Environment.NewLine)
somehow outside the loop.
We should also rename the array Split
to something more meaningful and something which isn't a known methodname.
Removing the With builder .. End With
will save us one indentionlevel which makes our code more readable.
Last the CheckFile()
method should be renamed to IsFileValid()
and should have an inputparameter fileName As String
.
Public Function GetAllLines(fileName As String) As List(Of LogLine)
Dim logLines As New List(Of LogLine)()
If Not CheckFile(fileName) Then Return logLines
Const MaxLineSize As Integer = 4096
Const stringBuilderCapacity As Integer = 8192
Dim buffer(MaxLineSize - 1) As Char
Dim charCount As Integer
Dim builder As New System.Text.StringBuilder(stringBuilderCapacity)
Dim singleLines As String()
Dim newLineChars As Char() = Environment.NewLine.ToCharArray()
Dim lineNumber As Integer = 0
Using reader As New System.IO.StreamReader(fileName)
Do Until reader.EndOfStream
charCount = reader.Read(buffer, 0, MaxLineSize)
builder.Append(buffer, 0, charCount)
If reader.Peek() >= 0 Then
builder.Append(reader.ReadLine())
End If
singleLines = builder.ToString().Split(newLineChars)
For Each singleLine As String In singleLines
logLines.Add(MakeLogLine(singleLine, lineNumber))
lineNumber += 1
Next
builder.Clear()
Loop
End Using
Return logLines
End Function
Now let us take a look at the ExtractRouteId()
method.
We will for readability change the name of the inputparameter to currentLine
.
For the same purpose we will rename rid
to routeId
.
As we could count the length of RouteID X
by hand, we should create a const out of our result and use it, but as we return String.Empty
if the Length < 59
we won't need to check if rid.Length < "RouteID X"
at all as this will be true.
We are only interested in the part of the string which starts at an specified index, so we can use the overloaded SubString()
method which takes only the startindex and returns the string starting at this index till the end.
Private Const minRouteIdLength As Integer = 9
Private Function ExtractRouteId(ByVal currentLine As String) As String
Dim routeId As String = String.Empty
If currentLine.Length < 59 OrElse _
Not (currentLine(50) = "R"c) OrElse _
Not (currentLine(55) = "I"c) Then
Return routeId
End If
routeId = currentLine.Substring(58)
If routeId.Length > minRouteIdLength AndAlso routeId(0) = "S"c Then
routeId = routeId.Substring(minRouteIdLength)
End If
Return routeId
End Function
As I need to leave the office now, I will just place the MakeLogLine()
method here and will add some explanation tomorrow.
Private Function MakeLogLine(ByVal currentLine As String, Optional lineNumber As Integer = 0) As LogLine
currentLine = currentLine.Trim()
' Extracts route Id separately to keep track of the ones that have been tagged.'
Dim routeId As String = ExtractRouteId(currentLine)
If String.IsNullOrEmpty(routeId) Then
Me.linesWithRouteIds.Add(New rIdLines With {.line = line, .rid = routeId})
End If
Return New LogLine With {.lstr = currentLine,
.threadId = ExtractThreadId(currentLine),
.requestId = ExtractRequestId(currentLine),
.lineNum = lineNumber,
.sourceId = ExtractSourceId(currentLine),
.performanceId = ExtractPerformanceId(currentLine),
.routeId = routeId ,
.methodName = ExtractMethodName(currentLine),
.time = ExtractTime(currentLine)}
End Function
-
\$\begingroup\$ Thank you so much for going through everything! Could you tell me the advantage of using String.IsNullOrEmpty and String.empty, over Nothing. I have also optimized the other functions based on your above recommendations, although I was struggling to see how I could improve ExtractTime, could you please have a look above? Thankyou! ~Harry \$\endgroup\$Harry Mumford-Turner– Harry Mumford-Turner2014年09月23日 16:12:57 +00:00Commented Sep 23, 2014 at 16:12
-
\$\begingroup\$ @Mumf For a string this does't do any difference, but for keeping the code consistent, as you e.g can't have a 'Nothing' as a boolean like in the
GetAllLines()
method. \$\endgroup\$Heslacher– Heslacher2014年09月24日 05:33:05 +00:00Commented Sep 24, 2014 at 5:33