How to extract specific tables from html file using native powershell commands? How to extract specific tables from html file using native powershell commands? powershell powershell

How to extract specific tables from html file using native powershell commands?


OK, this isn't thoroughly tested but works with your example table in PS 2.0 with IE11:

# Parsing HTML with IE.$oIE = New-Object -ComObject InternetExplorer.Application$oIE.Navigate("file.html")$oHtmlDoc = $oIE.Document# Getting table by ID.$oTable = $oHtmlDoc.getElementByID("table6")# Extracting table rows as a collection.$oTbody = $oTable.childNodes | Where-Object { $_.tagName -eq "tbody" }$cTrs = $oTbody.childNodes | Where-Object { $_.tagName -eq "tr" }# Creating a collection of table headers.$cThs = $cTrs[0].childNodes | Where-Object { $_.tagName -eq "th" }$cHeaders = @()foreach ($oTh in $cThs) {    $cHeaders += `        ($oTh.childNodes | Where-Object { $_.tagName -eq "b" }).innerHTML}# Converting rows to a collection of PS objects exportable to CSV.$cCsv = @()foreach ($oTr in $cTrs) {    $cTds = $oTr.childNodes | Where-Object { $_.tagName -eq "td" }    # Skipping the first row (headers).    if ([String]::IsNullOrEmpty($cTds)) { continue }    $oRow = New-Object PSObject    for ($i = 0; $i -lt $cHeaders.Count; $i++) {        $oRow | Add-Member -MemberType NoteProperty -Name $cHeaders[$i] `            -Value $cTds[$i].innerHTML    }    $cCsv += $oRow}# Closing IE.$oIE.Quit()# Exporting CSV.$cCsv | Export-Csv -Path "file.csv" -NoTypeInformation

Honestly, I didn't aim for optimal code. It's just an example of how you could work with DOM objects in PS and convert them to PS objects.


I see you accepted an answer but I thought I'd add a RegEx solution in here too. No COM objects needed for this one, and should be PSv2 friendly I'm pretty sure.

$Path = 'C:\Path\To\File.html'[regex]$regex = "(?s)<TABLE ID=.*?</TABLE>"$tables = $regex.matches((GC C:\Temp\test.txt -raw)).groups.valueForEach($String in $tables){    $table = $string.split("`n")    $CurTable = @()    $CurTableName = ([regex]'TABLE ID="([^"]*)"').matches($table[0]).groups[1].value    $CurTable += ($table[1] -replace "</B></TH><TH><B>",",") -replace "</?(TR|TH|B)>"    $CurTable += $table[2..($table.count-2)]|ForEach{$_ -replace "</TD><TD>","," -replace "</?T(D|R)>"}    $CurTable | convertfrom-csv | export-csv "C:\Path\To\Output\$CurTableName.csv" -notype}

That should output a CSV file for each table found. Such as table6.csv, table9.csv etc. If you wanted to output CSVs per HTML file you could wrap the entire thing in a ForEach loop like:

ForEach($File in (Get-ChildItem "$Path\*.html")){    Insert above code here}

You would need to modify the $tables = line so that it was GC $file.fullname to that it would load up each file as it iterated through.

Then just modify the Export-Csv to something like:

$CurTable | convertfrom-csv | export-csv "C:\Path\To\Output\$($File.BaseName)\$CurTableName.csv" -notype

So if you had Server01.html with 3 tables in it you would get a folder named Server01 with 3 CSV files in it, one for each table.