The code below extracts data from one web page - I emulate search, select all results from the list and when the list appears (42000 items) I loop through these items.
I get an id value from their href to create a proper link (example href
is href="javascript:NeuFenster('rb_id=570964&land_abk=bw'
, so I don't want any JS interaction while I can just create working links). Then I navigate to this link and extract whole text from last table row.
The main reason that I post this code is a memory issue. From all my testing here is what happens:
iexplore.exe
process starts for the main result page, with a helper 32 bitiexplore.exe
*32;- memory used by
iexplore.exe
*32 increases up to 308 MB and when result page is fully loaded, it remains +/- 2MB constant through whole extraction process; - when a popup window (subpage for each result) is open, another
iexplore.exe
*32 opens, but unlike in the previous process, its memory usage never stops increasing, despitePopUpWindow.Quit
andset PopupWindow = nothing
; - a third
iexplore.exe
process starts once first pop-up window is open, it has a constant memory usage around 24 MB. I have a problem understanding why such process appears at all; - each time I have scraped around 6000 elements, IE automation error appears. I have concluded that maybe it's because at this point all
iexplore.exe
*32 processes together consume around 500MB of memory, maybe it's some limit. - when I run this macro on other old PC, which has hardly any free RAM, automation errors appear usually around 10 times faster and therefore using this macro is impossible on that PC.
Is there any way to stop this increasing memory usage? Maybe I should set IHTMLElement
objects to nothing as well?
Are there any other things which could be upgrade efficiency / transparency of the code (apart from moving code to a module, applying multithreading and using XML)?
Option Explicit
Private Sub CommandButton1_Click()
Dim htmlele As IHTMLElement, htmlele2 As IHTMLElement
Dim ie As Object, PopupWindow As Object
Dim Link As String, LinkID As String
Dim i As Long
Dim ws As Worksheet
'Dim TimeCount As Double
Set ws = ThisWorkbook.Sheets("results")
With ws
Application.StatusBar = False
Application.ScreenUpdating = False
Application.DisplayStatusBar = True
Link = "https://www.handelsregisterbekanntmachungen.de/?aktion=suche#Ergebnis"
Set ie = CreateObject("InternetExplorer.Application")
ie.Navigate Link
'ie.Visible = True
Do Until ie.ReadyState = 4 And ie.Busy = False
DoEvents
Loop
'click on search button, search criteria are default
For Each htmlele In ie.document.getElementsByTagName("input")
If htmlele.Value = "Suche starten" Then
htmlele.Click
Do Until ie.ReadyState = 4 And ie.Busy = False
DoEvents
Loop
Exit For
End If
Next htmlele
'find the "big" dropdown list with > 100 items, choose "all" and fire on change event to get all results listed
For Each htmlele In ie.document.getElementsByTagName("select")
If htmlele.Children.Length > 100 Then
htmlele.Value = "all"
htmlele.selectedIndex = htmlele.Children.Length - 1
htmlele.FireEvent ("onchange")
Exit For
End If
Next htmlele
Do Until ie.ReadyState = 4 And ie.Busy = False
DoEvents
Loop
i = 1
'TimeCount = Timer
For Each htmlele In ie.document.getElementsByTagName("a")
If InStr(htmlele.href, "javascript:NeuFenster") <> 0 Then
If htmlele.textContent <> vbNullString Then
i = i + 1
LinkID = Mid(htmlele.href, InStr(htmlele.href, "id=") + 3, 6)
Link = "https://www.handelsregisterbekanntmachungen.de/skripte/hrb.php?rb_id=" & LinkID & "&land_abk=bw"
Set PopupWindow = CreateObject("InternetExplorer.Application")
PopupWindow.Navigate Link
'PopupWindow.Visible = True
Do Until PopupWindow.ReadyState = 4 And PopupWindow.Busy = False
DoEvents
Loop
For Each htmlele2 In PopupWindow.document.getElementsByTagName("tbody")
.Range("a" & i).Value = htmlele2.Children(htmlele2.Children.Length - 1).textContent
Next htmlele2
PopupWindow.Quit
Set PopupWindow = Nothing
'update status bar every 100 items
If i Mod 100 = 0 Then Application.StatusBar = i
End If
End If
Next htmlele
End With
ie.Quit
'Debug.Print Timer - TimeCount
Application.ScreenUpdating = True
Application.StatusBar = False
Exit Sub
End Sub
-
\$\begingroup\$ Welcome to Code Review! I hope you get some good answers. \$\endgroup\$Phrancis– Phrancis2016年10月02日 02:32:44 +00:00Commented Oct 2, 2016 at 2:32
-
1\$\begingroup\$ The pop-up you create for each individual case has no scripts--it's vanilla HTML. Rather than using a second IE process, you may want to look into XMLHTTP. \$\endgroup\$Jacob Manaker– Jacob Manaker2016年10月02日 19:41:07 +00:00Commented Oct 2, 2016 at 19:41
-
\$\begingroup\$ Yes, that's exactly what I want to learn next. Still some pop up pages which I have encountered had scripts and for them I must find a solution to this memory problem. \$\endgroup\$Ryszard Jędraszyk– Ryszard Jędraszyk2016年10月03日 15:44:19 +00:00Commented Oct 3, 2016 at 15:44
1 Answer 1
That is a lot of work for a little button Click
handler: your UserForm
is running the show, it's doing much more than just collecting user input - it is the program. This is a design pattern known as "Smart UI", where the whole application logic is implemented in the UI. That's good for prototyping, but quickly becomes an unmaintainable tangle of globals and click handlers with hundreds of lines of code and countless responsibilities; you'll want to take a look at UserForm best practices on SO Documentation to see how you can separate presentation concerns from the actual application logic.
So I'll pretend that code isn't in some click handler and actually a specialized method of some dedicated class module. But how specialized is it?
By dividing the procedure into multiple smaller ones that do one thing, you will gain in maintainability, and more importantly you will be reducing your objects' scopes.
In VBA the smallest possible scope is at procedure level. This means an object reference doesn't go out of scope until a procedure exits; if in theory setting a reference to Nothing
destroys it, in practice if it's still used in the same scope, then it's still in scope.
By splitting up your logic in smaller scopes you give much clearer clues to the VBA runtime: objects don't need to be set to Nothing
when they only ever live long enough to do their thing; in fact an object going out of scope is the only reliable way to make sure VBA cleans it up properly. PopupWindow
doesn't belong in the same scope as ie
.
Then there's the separation of concerns: collecting inputs (reading the scraped data) doesn't belong in the same procedure as outputting results (writing to some target worksheet).
There should be a function responsible for collecting the data, stuffing it into an array and returning that array to the caller; then another procedure takes that array and a given Worksheet
object, and writes all results in one single operarion, no looping is necessary. By separating inputs from outputs you'll be specializing each procedure, making the whole thing more performant. You don't need ScreenUpdating
turned off while you're collecting the data - you could use the Application.StatusBar
to update some progress indicator as you go (say once every 5%), and that wouldn't affect performance much - what's costing you here is the fact that you're writing to individual cells in a nested loop.
Extract procedures from each loop body, reduce variables/objects' scopes as much as possible, and separate concerns - you'll gain performance, lose memory footprint, and end up with overall more maintainable code.
-
\$\begingroup\$ Thanks for an answer, I will definitely separate the code into more scopes and give the feedback about how the memory behaves. As for arrays, I am quite hesitant in this particular case - outputting to cells lets me preserve all data in worksheet in case of disconnects or an error while interacting with website. Without multi-threading performance hit is almost none, but I will probably try at least creating arrays for every single row with data. \$\endgroup\$Ryszard Jędraszyk– Ryszard Jędraszyk2016年10月02日 19:00:45 +00:00Commented Oct 2, 2016 at 19:00
-
\$\begingroup\$ VBA/COM is single-threaded, you can't write multithreaded code in VBA. If you know the size of the ouptut 2D array from the start, you can dimension it once, populate it, and dump it onto the worksheet in one, single, instantaneous write operation; you're not going to "lose" any of the collected data if you handle runtime errors properly. \$\endgroup\$Mathieu Guindon– Mathieu Guindon2016年10月02日 19:04:16 +00:00Commented Oct 2, 2016 at 19:04
-
1\$\begingroup\$ I can't mullti-thread in VBA, but I can use VBA to create many VB script files that will do the work. \$\endgroup\$Ryszard Jędraszyk– Ryszard Jędraszyk2016年10月02日 19:59:37 +00:00Commented Oct 2, 2016 at 19:59
-
\$\begingroup\$ Unfortunately creating more scopes - Public Subs - especially separating the one responsible for creating PopupWindow instances didn't resolve memory problem, it's all the same. \$\endgroup\$Ryszard Jędraszyk– Ryszard Jędraszyk2016年10月02日 22:39:07 +00:00Commented Oct 2, 2016 at 22:39