13

I am trying to retrieve some information about a website, I want to look for a specific tag/class and then return the contained text value (innerHTML). This is what I have so far

$request = Invoke-WebRequest -Uri $url -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
$src = $request.RawContent
$HTML.write($src)
foreach ($obj in $HTML.all) { 
 $obj.getElementsByClassName('some-class-name') 
}

I think there is a problem with converting the HTML into the HTML object, since I see a lot of undefined properties and empty results when I'm trying to "Select-Object" them.

So after spending two days, how am I supposed to parse HTML with Powershell?

So since parsing HTML with regex is such a big no-no, how do I do it otherwise? Nothing seems to work.

asked Jun 28, 2019 at 14:53
1
  • 1
    Check out the HTMLAgility nuget package. It's raw .NET, but will help you immensely when dealing with HTML. Commented Jun 28, 2019 at 14:57

3 Answers 3

12

Since noone else has posted an answer, I managed to get a working solution with the following code:

$request = Invoke-WebRequest -Uri $URL -UseBasicParsing
$HTML = New-Object -Com "HTMLFile"
[string]$htmlBody = $request.Content
$HTML.write([ref]$htmlBody)
$filter = $HTML.getElementsByClassName($htmlClassName)

With some URLs I experienced that the $filter variable was empty while it was populated for other URLs. All in all this might work for your situation but it seems like Powershell isn't the way to go for more complex parsing.

answered Oct 7, 2019 at 18:25
Sign up to request clarification or add additional context in comments.

3 Comments

I would point out that this solution works only on PowerShell deployed on Windows. The COM objects are not available in PowerShell v7.x.x generally.
Use this answer, if .write() throws an error.
$filter shows a lot of properties and some of them are non empty but when I access those properties using dot, it returns nothing, such as $filter.innerText what am I doing wrong?
5

In 2020 with PowerShell 5+ you do it like this:

$searchClass = "banana" <# in this example we parse all elements of class "banana" but you can use any class name you wish #>
$myURI = "url.com" <# replace url.com with any website you want to scrape from #>
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12 <# using TLS 1.2 is vitally important #>
$req = Invoke-Webrequest -URI $myURI
$req.ParsedHtml.getElementsByClassName($searchClass) | %{Write-Host $_.innerhtml}
#for extra credit we can parse all the links
$req.ParsedHtml.getElementsByTagName('a') | %{Write-Host $_.href} #outputs all the links
Krzysztof Madej
41.9k10 gold badges116 silver badges139 bronze badges
answered Feb 10, 2020 at 18:51

4 Comments

When I look up IHTMLDocument2 I only see 2 methods, write and close. Where is getElementsByClassName declared? How do I find what other methods are available to the ParsedHtml property?
in 2020 with powershell 7.0.3 this unfortunately doesn't work. the response ("$req") will not have a property called ParsedHtml. Is this a powershell-classic-only feature?
try $req = Invoke-Webrequest -URI $myURI -usebasicparsing
@BenR "This parameter has been deprecated. Beginning with PowerShell 6.0.0, all Web requests use basic parsing only. This parameter is included for backwards compatibility only and any use of it has no effect on the operation of the cmdlet."
4

If installing a third-party module is an option:

  • The PSParseHTML module wraps the HTML Agility Pack ,[1] and the AngleSharp .NET libraries (NuGet packages); you can use either for HTML parsing; the latter requires -Engine AngleSharp as an opt-in; as for their respective DOMs (object models):

    • The HTML Agility Pack, which is used by default, provides an object model this similar to similar to the XML DOM provided by the standard System.Xml.XmlDocument NET type ([xml]). See this answer for an example of its use.

    • AngleSharp, which requires opt-in via -Engine AngleSharp, is built upon the official W3C specification and therefore provides a HTML DOM as available in web browsers. Notably, this means that its .QuerySelector() and .QuerySelectorAll() methods can be used with the usual CSS selectors, such as shown below.

  • An added advantage of using this module is that it is not just cross-edition, but also cross-platform; that is, you can use it in Windows PowerShell as well as in PowerShell (Core) 7+, and via the latter also on Unix-like platforms.


A self-contained example based on the AngleSharp engine that parses the home page of the English Wikipedia and extracts all HTML elements whose class attribute value is vector-menu-content-list:

# Install the PSParseHTML module on demand
If (-not (Get-Module -ErrorAction Ignore -ListAvailable PSParseHTML)) {
 Write-Verbose "Installing PSParseHTML module for the current user..."
 Install-Module -Scope CurrentUser PSParseHTML -ErrorAction Stop
}
# Using the AngleSharp engine, parse the home page of the English Wikipedia
# into an HTML DOM.
$htmlDom = ConvertFrom-Html -Engine AngleSharp -Url https://en.wikipedia.org
# Extract all HTML elements with a 'class' attribute value of 'vector-menu-content-list'
# and output their text content (.TextContent)
$htmlDom.QuerySelectorAll('.vector-menu-content-list').TextContent
answered Nov 8, 2023 at 16:30

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.