4
\$\begingroup\$

I've created a parser which is able to parse the full content of houzz.com. After going to the given address, it parses the sub-category links from the left sided bar. Each sub-category links spreads across several pages with pagination. Then it goes to the main page of each sub category links and parses the content over there. I tried to do the whole thing with my level best.

Sub HouzzFull()
Const url = "https://www.houzz.com/professionals/"
Dim http As New ServerXMLHTTP60
Dim html As New HTMLDocument, htm As New HTMLDocument, hmm As New HTMLDocument
Dim str As String, link As String, main As String
Dim topics As Object, topic As Object
Dim links As Object, post As Object, gist As HTMLHtmlElement
With http
 .Open "GET", url, False
 .send
 html.body.innerHTML = .responseText
End With
Set topics = html.getElementsByClassName("sidebar-item-label")
 For i = 1 To topics.Length - 1
 str = topics(i).href
 With http
 .Open "GET", str, False
 .send
 htm.body.innerHTML = .responseText
 End With
 Set links = htm.getElementsByClassName("pageNumber")
 For y = 0 To links.Length - IIf(links.Length > 0, 1, 0)
 If y > 0 Then
 link = links(y).href
 With http
 .Open "GET", link, False
 .send
 htm.body.innerHTML = .responseText
 End With
 End If
 Set topic = htm.getElementsByClassName("pro-title")
 For Each post In topic
 main = post.href
 With http
 .Open "GET", main, False
 .send
 hmm.body.innerHTML = .responseText
 End With
 For Each gist In hmm.getElementsByClassName("container profile-carded")
 x = x + 1
 With gist.getElementsByClassName("profile-full-name")
 If .Length Then Cells(x, 1) = .Item(0).innerText
 End With
 With gist.getElementsByClassName("info-list-text")
 If .Length > 1 Then Cells(x, 2) = Replace(.Item(1).innerText, "Contact:", "")
 End With
 With gist.getElementsByClassName("info-list-text")
 If .Length > 2 Then Cells(x, 3) = Replace(.Item(2).innerText, "Location:", "")
 End With
 With gist.getElementsByClassName("click-to-call-link text-gray-light trackMe")
 If .Length Then Cells(x, 4) = .Item(0).phone
 End With
 With gist.getElementsByClassName("proWebsiteLink")
 If .Length Then Cells(x, 5) = .Item(0).href
 End With
 Next gist
 Next post
 Next y
 Next i
End Sub
Mathieu Guindon
75.5k18 gold badges194 silver badges467 bronze badges
asked May 28, 2017 at 17:16
\$\endgroup\$
1
  • \$\begingroup\$ Is there a way to run an audit of just html flat files then scrape each element text into excel to ensure you're using the correct classnames and capturing any new updated ones? \$\endgroup\$ Commented Jun 23, 2017 at 8:51

1 Answer 1

1
\$\begingroup\$

I have a couple thoughts:

Be explicit in defining your constant called url, i.e.

Private Const url As String = "https://www.houzz.com/professionals/"

Be consistent in how you group your variable declarations. You've grouped some, but not others.

I'm thinking you can have one function that returns responseText each time you do this:

With http
 .Open "GET", url, False
 .send
 html.body.innerHTML = .responseText
End With

...and for htm, and hmm

When you use With, End With but only have one line between them, the excessive nesting doesn't aid readability and defeats the purpose of using With, End With. IMHO it would be clearer to just name it out explicitly.I'm thinking something along these lines:

Set foo = gist.getElementsByClassName("profile-full-name")
If foo.Length Then Cells(x, 1) = foo.Item(0).innerText

The code inside For Each post in Topic should be indented like you indented For Each gist In hmm.getElementsByClassName("container profile-carded").

answered Oct 21, 2017 at 13:39
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.