2
\$\begingroup\$

I've written this whole code to extract cell numbers from a website. It is extracting numbers perfectly but very slowly, and it's also hanging my Form while Extracting.

Imports HtmlAgilityPack
Imports System.Text.RegularExpressions
Public Class Extractor
Shared doc As New HtmlAgilityPack.HtmlDocument()
Public Shared Function ScrapLinks(TextBox1 As TextBox, ListBox1 As ListBox, lbllinks As Label)
Dim hw As New HtmlWeb()
Try
 doc = hw.Load(TextBox1.Text)
 doc.LoadHtml(doc.DocumentNode.SelectSingleNode("//*[@id='ad_list']").InnerHtml())
 For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a[@href]")
 Dim hrefValue As String = link.GetAttributeValue("href", String.Empty)
 If hrefValue.Contains("/detail/") Then
 If Not ListBox1.Items.Contains(hrefValue) Then
 ListBox1.Items.Add(hrefValue)
 End If
 End If
 Next
Catch ex As Exception
 MsgBox("Error " + ex.Message)
End Try
Return Nothing
End Function
Public Shared Function Scrapnums(lstbox As ListBox,lstnum As ListBox)
Try
 Dim hw As New HtmlWeb()
 doc = hw.Load(lstbox.SelectedItem)
 Dim data = doc.DocumentNode.SelectSingleNode("//*[@class='det_ad f_left']").InnerText
 Dim m As Match = Regex.Match(data, "(\+92|0092)-?\d{3}-?\d{7}|\d{11}|\d{4}-\d{7}")
 If Not lstnum.Items.Contains(m.Value) Then
 lstnum.Items.Add(m.Value)
 End If
Catch ex As Exception
End Try
Return Nothing
End Function
End Class
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Sep 14, 2014 at 16:27
\$\endgroup\$
3
  • \$\begingroup\$ Anyone please help me! \$\endgroup\$ Commented Sep 14, 2014 at 16:33
  • 4
    \$\begingroup\$ Please be patient. Unlike Stack Overflow, it usually takes hours to days for a response — both because we are a smaller community, and because it takes more time to write a code review answer. Furthermore, you've asked on a weekend, when activity levels are lower. \$\endgroup\$ Commented Sep 14, 2014 at 17:00
  • 1
    \$\begingroup\$ @200_success: No-one on SO was willing to help him efficiently scrape phone numbers from a website either, that may well contribute to the silence. \$\endgroup\$ Commented Sep 14, 2014 at 18:37

2 Answers 2

2
\$\begingroup\$

This code has a major design flaw.

  • Always keep your user interface (UI) separated from your data access layer (DAL).

ScrapLinks

  • Should be named ScrapeLinks.
  • If a Function doesn't return anything, then it should be a Sub. In this case, we still want it to be a function.
  • No need to have a shared HtmlDocument as this is created by the HtmlWeb.
  • In every iteration where the conditions are met you add a new value to the ListBox item collection. This will result in a bunch of windows messages being sent. I guess the most costly is WM_PAINT. Scrape the links first, then compare the result with current links. Use the AddRange method to insert new items.
  • A Try-Catch block, whose only purpose is to display an error, should never be used outside the UI layer.

Public Shared Function ScrapeLinks(url As String) As IEnumerable(Of String)
 If (url Is Nothing) Then Throw New ArgumentNullException("url")
 Dim web As New HtmlWeb()
 Dim document As HtmlDocument = web.Load(url)
 If (Not document.DocumentNode Is Nothing) Then
 Dim rootNode As HtmlNode = document.DocumentNode.SelectSingleNode("//*[@id='ad_list']")
 If (Not rootNode Is Nothing) Then
 document.LoadHtml(rootNode.InnerHtml())
 If (Not document.DocumentNode Is Nothing) Then
 Dim childNodes As HtmlNodeCollection = document.DocumentNode.SelectNodes("//a[@href]")
 If (Not childNodes Is Nothing) Then
 Return (
 From childNode As HtmlNode
 In childNodes
 Let href As String = childNode.GetAttributeValue("href", String.Empty)
 Where href.Contains("/detail/")
 Select href
 )
 End If
 End If
 End If
 End If
 Return Nothing
End Function

Scrapnums

  • (Same as above)
  • Should be named ScrapeNumber as it only returns one number.

Public Shared Function ScrapeNumber(url As String) As String
 If (url Is Nothing) Then Throw New ArgumentNullException("url")
 Dim web As New HtmlWeb()
 Dim document As HtmlDocument = web.Load(url)
 Dim root As HtmlNode = document.DocumentNode.SelectSingleNode("//*[@class='det_ad f_left']")
 If (Not root Is Nothing) Then
 Dim regexMatch As Match = Regex.Match(root.InnerText, "(\+92|0092)-?\d{3}-?\d{7}|\d{11}|\d{4}-\d{7}")
 If (Not regexMatch Is Nothing) Then
 Return regexMatch.Value
 End If
 End If
 Return Nothing
End Function
answered Sep 20, 2014 at 9:37
\$\endgroup\$
1
\$\begingroup\$

Don't repeat Ken Thompson's greatest regret: spelling creat() without an "e". The functions ScrapLinks() and Scrapnums() should be called ScrapeLinks() and ScrapeNums(). (Even if identifiers in VB.NET are case-insensitive, consistency would be nice.)

In Scrapnums(), you swallow the exception. If anything fails, the user gets no feedback.

It seems odd that you take an HtmlDocument (obtained from hw.Load()), stringify part of it (using .InnerHtml()), then parse it again to obtain another HtmlDocument.

That said, there may not be much performance to be gained by optimizing this code, as the execution time will probably be dominated by the HTTP request rather than the parsing.

answered Sep 16, 2014 at 4:53
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.