I created the below script to sent a request to a website, then convert the table in the results to an array of PSObjects which I can work with in PowerShell. This uses some nasty hacks (e.g. using a regex to strip HTML tags from my XML to try to improve the likelihood of it parsing as valid XML), but I couldn't find cleaner solutions / this seems to work.
Any thoughts on where I may have used a nasty hack where a more elegant solution exists?
function Create-Url {
[CmdletBinding()]
param (
#using parameter sets even though only one since we'll likely beef up this method to take other input types in future
[Parameter(ParameterSetName='UriFormAction', Mandatory = $true)]
[System.Uri]$Uri
,
[Parameter(ParameterSetName='UriFormAction', Mandatory = $true)]
[Microsoft.PowerShell.Commands.FormObject]$Form
)
process {
$builder = New-Object System.UriBuilder
$builder.Scheme = $url.Scheme
$builder.Host = $url.Host
$builder.Port = $url.Port
$builder.Path = $form.Action
write-output $builder.ToString()
}
}
function ConvertFrom-HtmlTableRow {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTableRow
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
$headers
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
[switch]$isHeader
)
process {
$cols = $htmlTableRow | select -expandproperty td
if($isHeader.IsPresent) {
write-output $cols
} else {
$colCount = ($cols | Measure-Object).Count
<# extra overhead that I dont care about right now
if(-not (($headers) -or ($headers -eq $null) -or (($headers | Measure-Object).Count -ne $colCount))) {
$headers = 1..$colCount | %{("Column_{0:00000}" -f $_)}
}
#>
$result = new-object -TypeName PSObject
1..$colCount | %{
$i = $_ - 1
if($headers[$i] -ne $null) {
$colName = $headers[$i]
$colValue = $cols[$i]
write-debug "$colName = $colValue"
$result | Add-Member NoteProperty $colName $colValue
}
}
write-output $result
}
}
}
function ConvertFrom-HtmlTable {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTable
)
process {
#currently only very basic <table><tr><td>...</td></tr></table> structure supported
#could be improved to better understand tbody, th, nested tables, etc
#$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow
#remove anything tags that aren't td or tr (simplifies our parsing of the data
[xml]$cleanedHtml = '<root>' + ($htmlTable | select -ExpandProperty innerHTML | %{($_ | out-string) -replace '(?:(</?tr)|(</?td))[^>]*(/?>)|(?:<[^>]*>)','1ドル2ドル3ドル'}) + '</root>'
$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
if ($headers -gt [System.String]::Empty) {
$cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
}
}
}
clear-host
[System.Uri]$url = 'http://some.site.with.tables.com/Subnet_Audit.asp' #link to some website
[System.String]$subnet = '123.45.67' #this relates to a specific paramter in the form; in my case the site checks the AV versions of all computers within a given IP range
$rqst = Invoke-WebRequest $url -SessionVariable avsv
$form = $rqst.Forms[0]
$form.Fields["strsubnet"] = $subnet
$url = Create-Url -Uri $url -Form $form
$rqst = Invoke-WebRequest -Uri $url -WebSession $avsv -Method $form.Method -Body $form.Fields
$rqst.ParsedHtml.getElementsByTagName('table') | ConvertFrom-HtmlTable
-
\$\begingroup\$ ps. an updated version of this code is also available here: stackoverflow.com/questions/25918094/… \$\endgroup\$JohnLBevan– JohnLBevan2015年08月11日 18:48:11 +00:00Commented Aug 11, 2015 at 18:48
1 Answer 1
Parsing HTML
I won't lay into you too badly about using RegEx on HTML since you clearly already know it's a bad idea and are trying to parse it as XML instead. Not a bad idea.
I have to recommend HTML Agility Pack though. It is designed for HTML and it works well with imperfect HTML (unlike the XML parser which is very strict).
It seems to be designed with C# in mind but I've used it in PowerShell before with great success.
Also, you're relying on the parsing done by Invoke-WebRequest
, which relies on Internet Explorer. This can have some side effects, for example it won't work on Server Core (because IE is not installed), and it won't work if it's run as a user who has never opened Internet Explorer before. It also sometimes messed up its check of whether IE is available.
Using agility pack won't rely on any of that. You can still use Invoke-WebRequest -UseBasicParsing
to retrieve the page without doing the DOM parsing and it doesn't need IE for that call.
Aliases
This is a personal choice, but I don't like aliases in finished scripts that are intended to be reused. I would replace %
with ForEach-Object
and ?
with Where-Object
.
Create-Url
There's a bug in this. You accept a parameter called $Uri
but refer to $url
in the function body.
Once you fix that, it has another problem: it ignores the existing path of the passed in URI, and it ignores whether the form action is absolute (will break), relative with leading /
(will work), relative with no leading /
(will work when original URI has no path, but will break otherwise).
ConvertFrom-HtmlTableRow
if($headers[$i] -ne $null)
This could just be written as:
if($headers[$i])
ConvertFrom-HtmlTable
if ($headers -gt [System.String]::Empty)
You can use ![String]::IsNullOrEmpty($headers)
for this, but really, it can be shortened to just:
if ($headers)