With some help I've put together a macro to scrape webpage elements by tag ID from URLs (indicated in column A of an Excel sheet) and insert the text into columns B, C and D. The code loops from 1st row until 1000th. So far so good.
The problem is that it seems to be taking SOOOO much time. Is there any way to speed up the process? I feel like my use of IE pages isn't optimal.
(Please note that I'm aware that On Error
the code is copying previous data as new but I have a manual easy work around to clean the data, which works fine until I get around to fixing that issue.)
Sub getCountry()
'
' GetCountry Macro
'
' Keyboard Shortcut: Ctrl+k
'
Dim IE As Object: Set IE = CreateObject("InternetExplorer.Application")
Dim ws As Worksheet: Set ws = ThisWorkbook.Sheets("Sheet1")
Dim Country As String
Dim Category As String
Dim Network As String
Dim n As Integer
n = 729
For n = 729 To 1000
With IE
Visible = False
.navigate ws.Range("A" & n).Value
Do
DoEvents
Loop Until .readyState = 4
End With
On Error Resume Next
Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText)
On Error Resume Next
Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText)
On Error Resume Next
Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)
ws.Range("B" & n).Value2 = Country
ws.Range("C" & n).Value2 = Category
ws.Range("D" & n).Value2 = Network
Next n
IE.Quit
End Sub
-
\$\begingroup\$ Is there a reason you did this in VBA, instead of scraping it with a more specialized tool and export to CSV (which Excel can import perfectly fine)? \$\endgroup\$Mast– Mast ♦2017年09月30日 20:03:54 +00:00Commented Sep 30, 2017 at 20:03
-
\$\begingroup\$ Probably because of a lack of knowledge and a strong like for excel. No knowledge of any tool specialized in scraping a long list of varying URLs but I'm open to suggestions. For the edit, I don't get the edit of the title, I thought phrasing it in a question was recommended. Thanks for the feedback and time ! \$\endgroup\$Michael McCarthy– Michael McCarthy2017年09月30日 20:06:36 +00:00Commented Sep 30, 2017 at 20:06
-
\$\begingroup\$ As for the edit, no, the current title is more appropriate. See this meta for more information. \$\endgroup\$Mast– Mast ♦2017年09月30日 20:13:48 +00:00Commented Sep 30, 2017 at 20:13
2 Answers 2
The indentation is terribly misleading.
Compare yours:
For n = 729 To 1000 With IE Visible = False .navigate ws.Range("A" & n).Value Do DoEvents Loop Until .readyState = 4 End With
To how Rubberduck's Smart Indenter default settings do it:
For n = 729 To 1000
With IE
Visible = False
.navigate ws.Range("A" & n).Value
Do
DoEvents
Loop Until .readyState = 4
End With
If I glance at your code, I've no idea where the With
block ends, and at second glance it appears to end in the middle of a Do
loop body (which would be illegal) - it takes a third read to pick up the matching Loop
keyword hidden in the code.
Before modifying anything, fix the indentation.
Running Rubberduck inspections reveal that variable Visible
is never declared, which means Option Explicit
isn't specified at the top of the module, which means VBA will happily compile typos, and that can easily lead to very hard-to-debug problems. ALWAYS specify Option Explicit
, and declare all variables.
The macro is implicitly Public
, n
should probably be declared As Long
, and the procedure name should be PascalCase
, i.e. GetCountry
, although the name suggests it's getting some value, which usually implies a Function
that returns that value - Sub
procedures do something, they don't get stuff.
On Error Resume Next
will make your macro keep running in error state, which makes the assignments fail and, as you noted, the previous iteration's value gets repeated. Why not "reset" the 3 values at the end of each iteration? That way a failed getElementByID
would write an empty string instead of duplicating data. That said...
On Error Resume Next Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText) On Error Resume Next Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText) On Error Resume Next Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)
Error handling doesn't work that way; only the first OERN changes anything, the other two are perfectly redundant. If you want to restore error handling you need to do On Error GoTo 0
:
On Error Resume Next 'suppresses run-time errors
Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText)
Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText)
Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)
Err.Clear 'clears error state
On Error GoTo 0 'restores run-time errors
As for speeding things up, assuming lightning-speed scraping this would be the bottleneck:
ws.Range("B" & n).Value2 = Country ws.Range("C" & n).Value2 = Category ws.Range("D" & n).Value2 = Network
You know how many iterations you're making, therefore you know from the start how many rows you're going to be writing to the worksheet; the fastest way to write to a worksheet isn't one cell at a time.
Declare a 2D array, populate it in your loop, and dump the array onto the worksheet to perform all Range
writes in one single, instantaneous operation. Otherwise you're having Excel raise Worksheet.Change
events, repainting itself, and possibly re-calculating things every time you write to a cell.
You could "cheat" and do Application.ScreenUpdating = False
at the top and set it back to True
at the end of the procedure, but that wouldn't be any more efficient code. Go with the 2D array.
The main problem with scraping in browser is this: along with the data (the only thing you need = HTML), the browser displays everything (HTML, CSS, JS) and renders it (even more time wasted).
The best solution for non interactive scraping is to use HTTPRequest, get only HTML as response. This way you are working with HTML DOM without rendering overhead.
Storing and writing the data in excel should be done in Mat's way, store all on 2D array, write once into sheet. Writing one cell at a time is brutally inefficient.