1
\$\begingroup\$

After a long try I've been able to create a script in vba which can successfully handle webpages with lazy-load. It can reach the bottom of a slow loading webpage if the hardcoded number of the loop is set accurately. I tried with few such pages and found it working flawlessly. The one I'm pasting below is created using finance.yahoo site. It can parse the title of different news after going down to a certain level of that page according to the loop I've defined. Now, what I wanna expect to have is do the same thing without using hardcoded delay what I've already used in my script. Thanks in advance for any guidance to the improvement.

Here is what I've written:

Sub Web_Data()
 Dim IE As New InternetExplorer, html As HTMLDocument
 Dim storage As Object, posts As Object
 With IE
 .Visible = True
 .navigate "https://finance.yahoo.com/"
 Do While .readyState <> READYSTATE_COMPLETE: Loop
 Set html = .document
 End With
 Application.Wait Now() + TimeValue("00:00:005")
 For scroll_down = 1 To 10
 Set storage = html.getElementsByClassName("StretchedBox")
 html.parentWindow.scrollBy 0, 99999
 Application.Wait Now() + TimeValue("00:00:005")
 Next scroll_down
 For Each posts In storage
 Row = Row + 1: Cells(Row, 1) = posts.ParentNode.innerText
 Next posts
 IE.Quit
End Sub

Reference to add to the library:

1. Microsoft Internet Controls
2. Microsoft HTML Object Library
asked Dec 19, 2017 at 20:14
\$\endgroup\$
2
  • \$\begingroup\$ The count of elements for StretchedBox increases as you move down the page (when it loads more content). With that site in your code, the unordered list (class="Mb(0) Ov(h) P(0) Wow(bw)") gets extended as end of page is near. Test keep going to end of page resulted 218 entries of "StretchedBox". Perhaps if you want all the headlines, you have to keep scrolling until the number of "StretchedBox" no longer increasing. You may also speed up loading if you disable image loading in the IE. \$\endgroup\$ Commented Dec 19, 2017 at 22:34
  • \$\begingroup\$ Yep it is. What if the number of titles of that page is unknown to me and I wanna run my script to get them all without being sure about how many loops it will take to reach there? Basically, If i increase the number of loop in my existing script, I can achieve what i want but I do not wish to hardcode any number. Thanks. \$\endgroup\$ Commented Dec 19, 2017 at 22:49

1 Answer 1

3
\$\begingroup\$

You may want to try this below. First is to try reduce loading times, especially images and videos when scrolling down. Then have a counter to count number of "StretchedBox" elements before the scroll down, then check up to a few more attempts until no more changes (I used 5 in code). You can remove the Debug lines in final version.

Note some "StretchedBox" isn't really an article headline (video related), you will need to work on those to filter out junk.

Option Explicit
Sub Web_Data()
 Const TITLES As String = "StretchedBox"
 Dim sh As Object, regval As String
 ' Change IE Options
 Set sh = CreateObject("WScript.Shell")
 regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Display Inline Images"
 sh.RegWrite regval, "no", "REG_SZ"
 regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Enable Browser Extensions"
 sh.RegWrite regval, "no", "REG_SZ"
 regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Play_Animations"
 sh.RegWrite regval, "no", "REG_SZ"
 regval = "HKCU\Software\Microsoft\Internet Explorer\Main\EnableAlternativeCodec"
 sh.RegWrite regval, "no", "REG_SZ"
 ' Prepare IE
 Dim IE As New InternetExplorer, html As HTMLDocument
 Dim posts As Object, sText As String
 Dim TitlesCount As Long, NoChangesCount As Integer, Row As Long
 With IE
 .Visible = True
 .navigate "https://finance.yahoo.com/"
 Debug.Print Now, "Navigated Start"
 Do Until .readyState = READYSTATE_COMPLETE: Loop
 Debug.Print Now, "Navigate Complete"
 Set html = .document
 End With
 NoChangesCount = 0
 On Error Resume Next
 Do
 DoEvents
 Application.Wait Now() + TimeValue("00:00:02")
 ' Remove all the annoying video parts
 RemoveVideos html
 TitlesCount = GetClassCount(html, TITLES)
 Debug.Print Now, TitlesCount
 html.parentWindow.scrollBy 0, 99999
 If TitlesCount = GetClassCount(html, TITLES) Then
 NoChangesCount = NoChangesCount + 1
 Else
 NoChangesCount = 0
 End If
 Loop Until NoChangesCount = 5 ' If no changes for some attempts, assume end of dynamic page
 On Error GoTo 0
 Columns("A").ClearContents
 For Each posts In html.getElementsByClassName(TITLES)
 sText = WorksheetFunction.Clean(posts.ParentNode.innerText)
 'sText = Trim(posts.ParentNode.innerText)
 If Len(sText) > 0 Then
 Row = Row + 1
 Cells(Row, 1).Value = sText
 End If
 Next posts
 IE.Quit
 ' Restore IE Options
 regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Display Inline Images"
 sh.RegWrite regval, "yes", "REG_SZ"
 regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Enable Browser Extensions"
 sh.RegWrite regval, "yes", "REG_SZ"
 regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Play_Animations"
 sh.RegWrite regval, "yes", "REG_SZ"
 regval = "HKCU\Software\Microsoft\Internet Explorer\Main\EnableAlternativeCodec"
 sh.RegWrite regval, "yes", "REG_SZ"
 Set sh = Nothing
End Sub
Private Function GetClassCount(Doc As HTMLDocument, ClassName As String) As Long
 GetClassCount = Doc.getElementsByClassName(ClassName).Length
End Function
Private Sub RemoveVideos(Doc As HTMLDocument)
 Dim oElement As Object
 For Each oElement In Doc.getElementsByClassName("yvp-main")
 oElement.innerHTML = ""
 Next
End Sub
answered Dec 20, 2017 at 4:50
\$\endgroup\$
1
  • \$\begingroup\$ The code in your dropbox link uses different method to break - it does stop eventually I believe, as it only breaks when NumChanges = TitleCount. With 4 sec wait each iteration, it takes 4 * TitlesCount, yesterday it has around 210, so you will have to wait 840s (14 minutes)! My code will stop if TitlesCount has not been increased 5 times, 2s wait each so, when it reaches the bottom of page, only 10 second is "wasted". Maybe I mislead you with NumChanges. I should have used NoChangesCount. Code updated. \$\endgroup\$ Commented Dec 20, 2017 at 21:41

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.