After a long try I've been able to create a script in vba which can successfully handle webpages with lazy-load. It can reach the bottom of a slow loading webpage if the hardcoded number of the loop is set accurately. I tried with few such pages and found it working flawlessly. The one I'm pasting below is created using finance.yahoo
site. It can parse the title of different news after going down to a certain level of that page according to the loop I've defined. Now, what I wanna expect to have is do the same thing without using hardcoded delay what I've already used in my script. Thanks in advance for any guidance to the improvement.
Here is what I've written:
Sub Web_Data()
Dim IE As New InternetExplorer, html As HTMLDocument
Dim storage As Object, posts As Object
With IE
.Visible = True
.navigate "https://finance.yahoo.com/"
Do While .readyState <> READYSTATE_COMPLETE: Loop
Set html = .document
End With
Application.Wait Now() + TimeValue("00:00:005")
For scroll_down = 1 To 10
Set storage = html.getElementsByClassName("StretchedBox")
html.parentWindow.scrollBy 0, 99999
Application.Wait Now() + TimeValue("00:00:005")
Next scroll_down
For Each posts In storage
Row = Row + 1: Cells(Row, 1) = posts.ParentNode.innerText
Next posts
IE.Quit
End Sub
Reference to add to the library:
1. Microsoft Internet Controls
2. Microsoft HTML Object Library
1 Answer 1
You may want to try this below. First is to try reduce loading times, especially images and videos when scrolling down. Then have a counter to count number of "StretchedBox" elements before the scroll down, then check up to a few more attempts until no more changes (I used 5 in code). You can remove the Debug
lines in final version.
Note some "StretchedBox" isn't really an article headline (video related), you will need to work on those to filter out junk.
Option Explicit
Sub Web_Data()
Const TITLES As String = "StretchedBox"
Dim sh As Object, regval As String
' Change IE Options
Set sh = CreateObject("WScript.Shell")
regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Display Inline Images"
sh.RegWrite regval, "no", "REG_SZ"
regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Enable Browser Extensions"
sh.RegWrite regval, "no", "REG_SZ"
regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Play_Animations"
sh.RegWrite regval, "no", "REG_SZ"
regval = "HKCU\Software\Microsoft\Internet Explorer\Main\EnableAlternativeCodec"
sh.RegWrite regval, "no", "REG_SZ"
' Prepare IE
Dim IE As New InternetExplorer, html As HTMLDocument
Dim posts As Object, sText As String
Dim TitlesCount As Long, NoChangesCount As Integer, Row As Long
With IE
.Visible = True
.navigate "https://finance.yahoo.com/"
Debug.Print Now, "Navigated Start"
Do Until .readyState = READYSTATE_COMPLETE: Loop
Debug.Print Now, "Navigate Complete"
Set html = .document
End With
NoChangesCount = 0
On Error Resume Next
Do
DoEvents
Application.Wait Now() + TimeValue("00:00:02")
' Remove all the annoying video parts
RemoveVideos html
TitlesCount = GetClassCount(html, TITLES)
Debug.Print Now, TitlesCount
html.parentWindow.scrollBy 0, 99999
If TitlesCount = GetClassCount(html, TITLES) Then
NoChangesCount = NoChangesCount + 1
Else
NoChangesCount = 0
End If
Loop Until NoChangesCount = 5 ' If no changes for some attempts, assume end of dynamic page
On Error GoTo 0
Columns("A").ClearContents
For Each posts In html.getElementsByClassName(TITLES)
sText = WorksheetFunction.Clean(posts.ParentNode.innerText)
'sText = Trim(posts.ParentNode.innerText)
If Len(sText) > 0 Then
Row = Row + 1
Cells(Row, 1).Value = sText
End If
Next posts
IE.Quit
' Restore IE Options
regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Display Inline Images"
sh.RegWrite regval, "yes", "REG_SZ"
regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Enable Browser Extensions"
sh.RegWrite regval, "yes", "REG_SZ"
regval = "HKCU\Software\Microsoft\Internet Explorer\Main\Play_Animations"
sh.RegWrite regval, "yes", "REG_SZ"
regval = "HKCU\Software\Microsoft\Internet Explorer\Main\EnableAlternativeCodec"
sh.RegWrite regval, "yes", "REG_SZ"
Set sh = Nothing
End Sub
Private Function GetClassCount(Doc As HTMLDocument, ClassName As String) As Long
GetClassCount = Doc.getElementsByClassName(ClassName).Length
End Function
Private Sub RemoveVideos(Doc As HTMLDocument)
Dim oElement As Object
For Each oElement In Doc.getElementsByClassName("yvp-main")
oElement.innerHTML = ""
Next
End Sub
-
\$\begingroup\$ The code in your dropbox link uses different method to break - it does stop eventually I believe, as it only breaks when
NumChanges = TitleCount
. With 4 sec wait each iteration, it takes4 * TitlesCount
, yesterday it has around 210, so you will have to wait 840s (14 minutes)! My code will stop ifTitlesCount
has not been increased 5 times, 2s wait each so, when it reaches the bottom of page, only 10 second is "wasted". Maybe I mislead you withNumChanges
. I should have usedNoChangesCount
. Code updated. \$\endgroup\$PatricK– PatricK2017年12月20日 21:41:41 +00:00Commented Dec 20, 2017 at 21:41
StretchedBox
increases as you move down the page (when it loads more content). With that site in your code, the unordered list (class="Mb(0) Ov(h) P(0) Wow(bw)"
) gets extended as end of page is near. Test keep going to end of page resulted 218 entries of "StretchedBox". Perhaps if you want all the headlines, you have to keep scrolling until the number of "StretchedBox" no longer increasing. You may also speed up loading if you disable image loading in the IE. \$\endgroup\$