How to extract specific tables from html file using native powershell commands?
OK, this isn't thoroughly tested but works with your example table in PS 2.0 with IE11:
# Parsing HTML with IE.$oIE = New-Object -ComObject InternetExplorer.Application$oIE.Navigate("file.html")$oHtmlDoc = $oIE.Document# Getting table by ID.$oTable = $oHtmlDoc.getElementByID("table6")# Extracting table rows as a collection.$oTbody = $oTable.childNodes | Where-Object { $_.tagName -eq "tbody" }$cTrs = $oTbody.childNodes | Where-Object { $_.tagName -eq "tr" }# Creating a collection of table headers.$cThs = $cTrs[0].childNodes | Where-Object { $_.tagName -eq "th" }$cHeaders = @()foreach ($oTh in $cThs) { $cHeaders += ` ($oTh.childNodes | Where-Object { $_.tagName -eq "b" }).innerHTML}# Converting rows to a collection of PS objects exportable to CSV.$cCsv = @()foreach ($oTr in $cTrs) { $cTds = $oTr.childNodes | Where-Object { $_.tagName -eq "td" } # Skipping the first row (headers). if ([String]::IsNullOrEmpty($cTds)) { continue } $oRow = New-Object PSObject for ($i = 0; $i -lt $cHeaders.Count; $i++) { $oRow | Add-Member -MemberType NoteProperty -Name $cHeaders[$i] ` -Value $cTds[$i].innerHTML } $cCsv += $oRow}# Closing IE.$oIE.Quit()# Exporting CSV.$cCsv | Export-Csv -Path "file.csv" -NoTypeInformation
Honestly, I didn't aim for optimal code. It's just an example of how you could work with DOM objects in PS and convert them to PS objects.
I see you accepted an answer but I thought I'd add a RegEx solution in here too. No COM objects needed for this one, and should be PSv2 friendly I'm pretty sure.
$Path = 'C:\Path\To\File.html'[regex]$regex = "(?s)<TABLE ID=.*?</TABLE>"$tables = $regex.matches((GC C:\Temp\test.txt -raw)).groups.valueForEach($String in $tables){ $table = $string.split("`n") $CurTable = @() $CurTableName = ([regex]'TABLE ID="([^"]*)"').matches($table[0]).groups[1].value $CurTable += ($table[1] -replace "</B></TH><TH><B>",",") -replace "</?(TR|TH|B)>" $CurTable += $table[2..($table.count-2)]|ForEach{$_ -replace "</TD><TD>","," -replace "</?T(D|R)>"} $CurTable | convertfrom-csv | export-csv "C:\Path\To\Output\$CurTableName.csv" -notype}
That should output a CSV file for each table found. Such as table6.csv, table9.csv etc. If you wanted to output CSVs per HTML file you could wrap the entire thing in a ForEach loop like:
ForEach($File in (Get-ChildItem "$Path\*.html")){ Insert above code here}
You would need to modify the $tables =
line so that it was GC $file.fullname
to that it would load up each file as it iterated through.
Then just modify the Export-Csv to something like:
$CurTable | convertfrom-csv | export-csv "C:\Path\To\Output\$($File.BaseName)\$CurTableName.csv" -notype
So if you had Server01.html with 3 tables in it you would get a folder named Server01 with 3 CSV files in it, one for each table.