4
\$\begingroup\$

With some help I've put together a macro to scrape webpage elements by tag ID from URLs (indicated in column A of an Excel sheet) and insert the text into columns B, C and D. The code loops from 1st row until 1000th. So far so good.

The problem is that it seems to be taking SOOOO much time. Is there any way to speed up the process? I feel like my use of IE pages isn't optimal.

(Please note that I'm aware that On Error the code is copying previous data as new but I have a manual easy work around to clean the data, which works fine until I get around to fixing that issue.)

Sub getCountry()
'
' GetCountry Macro
'
' Keyboard Shortcut: Ctrl+k
'
 Dim IE As Object: Set IE = CreateObject("InternetExplorer.Application")
 Dim ws As Worksheet: Set ws = ThisWorkbook.Sheets("Sheet1")
 Dim Country As String
 Dim Category As String
 Dim Network As String
 Dim n As Integer
 n = 729
 For n = 729 To 1000
 With IE
 Visible = False
 .navigate ws.Range("A" & n).Value
 Do
 DoEvents
 Loop Until .readyState = 4
 End With
 On Error Resume Next
 Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText)
 On Error Resume Next
 Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText)
 On Error Resume Next
 Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)
 ws.Range("B" & n).Value2 = Country
 ws.Range("C" & n).Value2 = Category
 ws.Range("D" & n).Value2 = Network
 Next n
 IE.Quit
 End Sub
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Sep 30, 2017 at 19:49
\$\endgroup\$
3
  • \$\begingroup\$ Is there a reason you did this in VBA, instead of scraping it with a more specialized tool and export to CSV (which Excel can import perfectly fine)? \$\endgroup\$ Commented Sep 30, 2017 at 20:03
  • \$\begingroup\$ Probably because of a lack of knowledge and a strong like for excel. No knowledge of any tool specialized in scraping a long list of varying URLs but I'm open to suggestions. For the edit, I don't get the edit of the title, I thought phrasing it in a question was recommended. Thanks for the feedback and time ! \$\endgroup\$ Commented Sep 30, 2017 at 20:06
  • \$\begingroup\$ As for the edit, no, the current title is more appropriate. See this meta for more information. \$\endgroup\$ Commented Sep 30, 2017 at 20:13

2 Answers 2

4
\$\begingroup\$

The indentation is terribly misleading.

Compare yours:

 For n = 729 To 1000
 With IE
 Visible = False
 .navigate ws.Range("A" & n).Value
 Do
 DoEvents
 Loop Until .readyState = 4
 End With

To how Rubberduck's Smart Indenter default settings do it:

For n = 729 To 1000
 With IE
 Visible = False
 .navigate ws.Range("A" & n).Value
 Do
 DoEvents
 Loop Until .readyState = 4
 End With

If I glance at your code, I've no idea where the With block ends, and at second glance it appears to end in the middle of a Do loop body (which would be illegal) - it takes a third read to pick up the matching Loop keyword hidden in the code.

Before modifying anything, fix the indentation.

Running Rubberduck inspections reveal that variable Visible is never declared, which means Option Explicit isn't specified at the top of the module, which means VBA will happily compile typos, and that can easily lead to very hard-to-debug problems. ALWAYS specify Option Explicit, and declare all variables.

The macro is implicitly Public, n should probably be declared As Long, and the procedure name should be PascalCase, i.e. GetCountry, although the name suggests it's getting some value, which usually implies a Function that returns that value - Sub procedures do something, they don't get stuff.

On Error Resume Next will make your macro keep running in error state, which makes the assignments fail and, as you noted, the previous iteration's value gets repeated. Why not "reset" the 3 values at the end of each iteration? That way a failed getElementByID would write an empty string instead of duplicating data. That said...

 On Error Resume Next
 Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText)
 On Error Resume Next
 Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText)
 On Error Resume Next
 Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)

Error handling doesn't work that way; only the first OERN changes anything, the other two are perfectly redundant. If you want to restore error handling you need to do On Error GoTo 0:

 On Error Resume Next 'suppresses run-time errors
 Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText)
 Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText)
 Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)
 Err.Clear 'clears error state
 On Error GoTo 0 'restores run-time errors

As for speeding things up, assuming lightning-speed scraping this would be the bottleneck:

 ws.Range("B" & n).Value2 = Country
 ws.Range("C" & n).Value2 = Category
 ws.Range("D" & n).Value2 = Network

You know how many iterations you're making, therefore you know from the start how many rows you're going to be writing to the worksheet; the fastest way to write to a worksheet isn't one cell at a time.

Declare a 2D array, populate it in your loop, and dump the array onto the worksheet to perform all Range writes in one single, instantaneous operation. Otherwise you're having Excel raise Worksheet.Change events, repainting itself, and possibly re-calculating things every time you write to a cell.

You could "cheat" and do Application.ScreenUpdating = False at the top and set it back to True at the end of the procedure, but that wouldn't be any more efficient code. Go with the 2D array.

answered Sep 30, 2017 at 20:29
\$\endgroup\$
3
\$\begingroup\$

The main problem with scraping in browser is this: along with the data (the only thing you need = HTML), the browser displays everything (HTML, CSS, JS) and renders it (even more time wasted).

The best solution for non interactive scraping is to use HTTPRequest, get only HTML as response. This way you are working with HTML DOM without rendering overhead.

Storing and writing the data in excel should be done in Mat's way, store all on 2D array, write once into sheet. Writing one cell at a time is brutally inefficient.

answered Oct 1, 2017 at 6:38
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.