I navigate with IE, do various things, then select all results option from a list and fire on click event. Once all results have been listed, I loop through their URLs, using the following code to gather data inside these URLs.
What can I do to improve speed of data scraping (apart from VB Script multithreading)?
With
- IE navigation,
visible = false
and no images I had 1000 links extracted in 3min 45sec - msxml2.XMLHTTP in 3min 30sec
- msxml2.serverXMLHTTP 1min 40sec -
.setRequestHeader "Content-Type", "text/xml"
and.setRequestHeader "Accept-Encoding", "compress, gzip"
didn't affect performance
Is it better to declare and create objects within this Sub (meaning inside the loop for 42,000 links) or outside and make them public variables?
Public Sub PrintCompanyData()
Dim oDom As Object: Set oDom = CreateObject("htmlFile")
Dim htmlelePopUp As IHTMLElement
Dim unformattedData As String
With CreateObject("msxml2.serverXMLHTTP") 'CreateObject("msxml2.xmlhttp")
.Open "GET", Link, False
'.setRequestHeader "Content-Type", "text/xml"
'.setRequestHeader "Accept-Encoding", "compress, gzip"
.send
If .ReadyState = 4 And .Status = 200 Then
oDom.body.innerHTML = .responseText
With WS
For Each htmlelePopUp In oDom.getElementsByTagName("tbody")
unformattedData = htmlelePopUp.Children(htmlelePopUp.Children.Length - 1).innerText
.Range("a" & ItemCount + 1).Value2 = Replace(unformattedData, Chr(10), vbNullString)
Next htmlelePopUp
End With
End If
End With
Set oDom = Nothing
End Sub
2 Answers 2
It's likely not a huge improvement, because the vast majority of this program's time will be spent going over the network, but there is a micro-optimization you can make here.
Instead of latebinding like this
With CreateObject("msxml2.serverXMLHTTP")
You can early bind your objects, eliminating the time COM takes to do a dynamic lookup. However, the time this saves will be minuscule compared to the total runtime.
I'm not bashing on VBA, I <3 VBA, but it's just not the right tool for the job if you need it to run faster than this. What you really need is a language that's capable of multi-threading and asynchronous programming to make the most of the CPU time while you're waiting for those Http requests to return. Sounds like it's time to learn some C# or F# to me. Even Powershell may be a good option.
-
2\$\begingroup\$ Using an early binding reduced the runtime by 2% on average. \$\endgroup\$Ryszard Jędraszyk– Ryszard Jędraszyk2016年10月29日 22:23:38 +00:00Commented Oct 29, 2016 at 22:23
The line of code ".Open "GET", Link, False" causes the process to be asynchronous. Change that to "True" in the third parameter and the process will run asynchronously. Apparently, VBA is underappreciated.
-
1\$\begingroup\$ Runs asynchronously... Your first two sentences are contradictory, and it's not clear how the third relates to anything. Asynchronous... and then what? Sleep-wait until it finishes? VBA is single-threaded, surely you need an event sink to respond to asynchronous processing - and that changes OP's code quite drastically. Mind illustrating? Or was this post meant to be a comment on the OP? \$\endgroup\$Mathieu Guindon– Mathieu Guindon2017年09月26日 03:35:50 +00:00Commented Sep 26, 2017 at 3:35
-
\$\begingroup\$ Yes, the correction is "The line of code ".Open "GET", Link, False" causes the process to be synchronous." A synchronous process places other demands on the coded application that are too involved to discuss here. \$\endgroup\$Warren Hall– Warren Hall2019年09月24日 11:15:03 +00:00Commented Sep 24, 2019 at 11:15