I've created a parser which is able to parse the full content of houzz.com. After going to the given address, it parses the sub-category links from the left sided bar. Each sub-category links spreads across several pages with pagination. Then it goes to the main page of each sub category links and parses the content over there. I tried to do the whole thing with my level best.
Sub HouzzFull()
Const url = "https://www.houzz.com/professionals/"
Dim http As New ServerXMLHTTP60
Dim html As New HTMLDocument, htm As New HTMLDocument, hmm As New HTMLDocument
Dim str As String, link As String, main As String
Dim topics As Object, topic As Object
Dim links As Object, post As Object, gist As HTMLHtmlElement
With http
.Open "GET", url, False
.send
html.body.innerHTML = .responseText
End With
Set topics = html.getElementsByClassName("sidebar-item-label")
For i = 1 To topics.Length - 1
str = topics(i).href
With http
.Open "GET", str, False
.send
htm.body.innerHTML = .responseText
End With
Set links = htm.getElementsByClassName("pageNumber")
For y = 0 To links.Length - IIf(links.Length > 0, 1, 0)
If y > 0 Then
link = links(y).href
With http
.Open "GET", link, False
.send
htm.body.innerHTML = .responseText
End With
End If
Set topic = htm.getElementsByClassName("pro-title")
For Each post In topic
main = post.href
With http
.Open "GET", main, False
.send
hmm.body.innerHTML = .responseText
End With
For Each gist In hmm.getElementsByClassName("container profile-carded")
x = x + 1
With gist.getElementsByClassName("profile-full-name")
If .Length Then Cells(x, 1) = .Item(0).innerText
End With
With gist.getElementsByClassName("info-list-text")
If .Length > 1 Then Cells(x, 2) = Replace(.Item(1).innerText, "Contact:", "")
End With
With gist.getElementsByClassName("info-list-text")
If .Length > 2 Then Cells(x, 3) = Replace(.Item(2).innerText, "Location:", "")
End With
With gist.getElementsByClassName("click-to-call-link text-gray-light trackMe")
If .Length Then Cells(x, 4) = .Item(0).phone
End With
With gist.getElementsByClassName("proWebsiteLink")
If .Length Then Cells(x, 5) = .Item(0).href
End With
Next gist
Next post
Next y
Next i
End Sub
-
\$\begingroup\$ Is there a way to run an audit of just html flat files then scrape each element text into excel to ensure you're using the correct classnames and capturing any new updated ones? \$\endgroup\$Allen Mattson– Allen Mattson2017年06月23日 08:51:47 +00:00Commented Jun 23, 2017 at 8:51
1 Answer 1
I have a couple thoughts:
Be explicit in defining your constant called url, i.e.
Private Const url As String = "https://www.houzz.com/professionals/"
Be consistent in how you group your variable declarations. You've grouped some, but not others.
I'm thinking you can have one function that returns responseText each time you do this:
With http
.Open "GET", url, False
.send
html.body.innerHTML = .responseText
End With
...and for htm, and hmm
When you use With, End With but only have one line between them, the excessive nesting doesn't aid readability and defeats the purpose of using With, End With. IMHO it would be clearer to just name it out explicitly.I'm thinking something along these lines:
Set foo = gist.getElementsByClassName("profile-full-name")
If foo.Length Then Cells(x, 1) = foo.Item(0).innerText
The code inside For Each post in Topic
should be indented like you indented For Each gist In hmm.getElementsByClassName("container profile-carded")
.