I've written this whole code to extract cell numbers from a website. It is extracting numbers perfectly but very slowly, and it's also hanging my Form while Extracting.
Imports HtmlAgilityPack
Imports System.Text.RegularExpressions
Public Class Extractor
Shared doc As New HtmlAgilityPack.HtmlDocument()
Public Shared Function ScrapLinks(TextBox1 As TextBox, ListBox1 As ListBox, lbllinks As Label)
Dim hw As New HtmlWeb()
Try
doc = hw.Load(TextBox1.Text)
doc.LoadHtml(doc.DocumentNode.SelectSingleNode("//*[@id='ad_list']").InnerHtml())
For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a[@href]")
Dim hrefValue As String = link.GetAttributeValue("href", String.Empty)
If hrefValue.Contains("/detail/") Then
If Not ListBox1.Items.Contains(hrefValue) Then
ListBox1.Items.Add(hrefValue)
End If
End If
Next
Catch ex As Exception
MsgBox("Error " + ex.Message)
End Try
Return Nothing
End Function
Public Shared Function Scrapnums(lstbox As ListBox,lstnum As ListBox)
Try
Dim hw As New HtmlWeb()
doc = hw.Load(lstbox.SelectedItem)
Dim data = doc.DocumentNode.SelectSingleNode("//*[@class='det_ad f_left']").InnerText
Dim m As Match = Regex.Match(data, "(\+92|0092)-?\d{3}-?\d{7}|\d{11}|\d{4}-\d{7}")
If Not lstnum.Items.Contains(m.Value) Then
lstnum.Items.Add(m.Value)
End If
Catch ex As Exception
End Try
Return Nothing
End Function
End Class
-
\$\begingroup\$ Anyone please help me! \$\endgroup\$Shehryar Iqbal– Shehryar Iqbal2014年09月14日 16:33:25 +00:00Commented Sep 14, 2014 at 16:33
-
4\$\begingroup\$ Please be patient. Unlike Stack Overflow, it usually takes hours to days for a response — both because we are a smaller community, and because it takes more time to write a code review answer. Furthermore, you've asked on a weekend, when activity levels are lower. \$\endgroup\$200_success– 200_success2014年09月14日 17:00:30 +00:00Commented Sep 14, 2014 at 17:00
-
1\$\begingroup\$ @200_success: No-one on SO was willing to help him efficiently scrape phone numbers from a website either, that may well contribute to the silence. \$\endgroup\$The Blue Dog– The Blue Dog2014年09月14日 18:37:18 +00:00Commented Sep 14, 2014 at 18:37
2 Answers 2
This code has a major design flaw.
- Always keep your user interface (UI) separated from your data access layer (DAL).
ScrapLinks
- Should be named
ScrapeLinks
. - If a
Function
doesn't return anything, then it should be aSub
. In this case, we still want it to be a function. - No need to have a shared
HtmlDocument
as this is created by theHtmlWeb
. - In every iteration where the conditions are met you add a new value to the
ListBox
item collection. This will result in a bunch of windows messages being sent. I guess the most costly isWM_PAINT
. Scrape the links first, then compare the result with current links. Use theAddRange
method to insert new items. - A
Try-Catch
block, whose only purpose is to display an error, should never be used outside the UI layer.
Public Shared Function ScrapeLinks(url As String) As IEnumerable(Of String)
If (url Is Nothing) Then Throw New ArgumentNullException("url")
Dim web As New HtmlWeb()
Dim document As HtmlDocument = web.Load(url)
If (Not document.DocumentNode Is Nothing) Then
Dim rootNode As HtmlNode = document.DocumentNode.SelectSingleNode("//*[@id='ad_list']")
If (Not rootNode Is Nothing) Then
document.LoadHtml(rootNode.InnerHtml())
If (Not document.DocumentNode Is Nothing) Then
Dim childNodes As HtmlNodeCollection = document.DocumentNode.SelectNodes("//a[@href]")
If (Not childNodes Is Nothing) Then
Return (
From childNode As HtmlNode
In childNodes
Let href As String = childNode.GetAttributeValue("href", String.Empty)
Where href.Contains("/detail/")
Select href
)
End If
End If
End If
End If
Return Nothing
End Function
Scrapnums
- (Same as above)
- Should be named
ScrapeNumber
as it only returns one number.
Public Shared Function ScrapeNumber(url As String) As String
If (url Is Nothing) Then Throw New ArgumentNullException("url")
Dim web As New HtmlWeb()
Dim document As HtmlDocument = web.Load(url)
Dim root As HtmlNode = document.DocumentNode.SelectSingleNode("//*[@class='det_ad f_left']")
If (Not root Is Nothing) Then
Dim regexMatch As Match = Regex.Match(root.InnerText, "(\+92|0092)-?\d{3}-?\d{7}|\d{11}|\d{4}-\d{7}")
If (Not regexMatch Is Nothing) Then
Return regexMatch.Value
End If
End If
Return Nothing
End Function
Don't repeat Ken Thompson's greatest regret: spelling creat()
without an "e". The functions ScrapLinks()
and Scrapnums()
should be called ScrapeLinks()
and ScrapeNums()
. (Even if identifiers in VB.NET are case-insensitive, consistency would be nice.)
In Scrapnums()
, you swallow the exception. If anything fails, the user gets no feedback.
It seems odd that you take an HtmlDocument
(obtained from hw.Load()
), stringify part of it (using .InnerHtml()
), then parse it again to obtain another HtmlDocument
.
That said, there may not be much performance to be gained by optimizing this code, as the execution time will probably be dominated by the HTTP request rather than the parsing.
Explore related questions
See similar questions with these tags.