Use getElementById on HTMLElement instead of HTMLDocument Use getElementById on HTMLElement instead of HTMLDocument vba vba

Use getElementById on HTMLElement instead of HTMLDocument


Sub Scrape()    Dim Browser As InternetExplorer    Dim Document As htmlDocument    Dim Elements As IHTMLElementCollection    Dim Element As IHTMLElement    Set Browser = New InternetExplorer    Browser.Visible = True    Browser.navigate "http://www.stackoverflow.com"    Do While Browser.Busy And Not Browser.readyState = READYSTATE_COMPLETE        DoEvents    Loop    Set Document = Browser.Document    Set Elements = Document.getElementById("hmenus").getElementsByTagName("li")    For Each Element In Elements        Debug.Print Element.innerText        'Questions        'Tags        'Users        'Badges        'Unanswered        'Ask Question    Next Element    Set Document = Nothing    Set Browser = NothingEnd Sub


I don't like it either.

So use javascript:

Public Function GetJavaScriptResult(doc as HTMLDocument, jsString As String) As String    Dim el As IHTMLElement    Dim nd As HTMLDOMTextNode    Set el = doc.createElement("INPUT")    Do        el.ID = GenerateRandomAlphaString(100)    Loop Until Document.getElementById(el.ID) Is Nothing    el.Style.display = "none"    Set nd = Document.appendChild(el)    doc.parentWindow.ExecScript "document.getElementById('" & el.ID & "').value = " & jsString    GetJavaScriptResult = Document.getElementById(el.ID).Value    Document.removeChild ndEnd FunctionFunction GenerateRandomAlphaString(Length As Long) As String    Dim i As Long    Dim Result As String    Randomize Timer    For i = 1 To Length        Result = Result & Chr(Int(Rnd(Timer) * 26 + 65 + Round(Rnd(Timer)) * 32))    Next i    GenerateRandomAlphaString = ResultEnd Function

Let me know if you have any problems with this; I've changed the context from a method to a function.

By the way, what version of IE are you using? I suspect you're on < IE8. If you upgrade to IE8 I presume it'll update shdocvw.dll to ieframe.dll and you will be able to use document.querySelector/All.

Edit

Comment response which isn't really a comment:Basically the way to do this in VBA is to traverse the child nodes. The problem is you don't get the correct return types. You could fix this by making your own classes that (separately) implement IHTMLElement and IHTMLElementCollection; but that's WAY too much of a pain for me to do it without getting paid :). If you're determined, go and read up on the Implements keyword for VB6/VBA.

Public Function getSubElementsByTagName(el As IHTMLElement, tagname As String) As Collection    Dim descendants As New Collection    Dim results As New Collection    Dim i As Long    getDescendants el, descendants    For i = 1 To descendants.Count        If descendants(i).tagname = tagname Then            results.Add descendants(i)        End If    Next i    getSubElementsByTagName = resultsEnd FunctionPublic Function getDescendants(nd As IHTMLElement, ByRef descendants As Collection)    Dim i As Long    descendants.Add nd    For i = 1 To nd.Children.Length        getDescendants nd.Children.Item(i), descendants    Next iEnd Function


Thanks to dee for the answer above with the Scrape() subroutine. The code worked perfectly as written, and I was able to then convert the code to work with the specific website I am trying to scrape.

I do not have enough reputation to upvote or to comment, but I do actually have some minor improvements to add to dee's answer:

  1. You will need to add the VBA Reference via "Tools\References" to "Microsoft HTML Object Library in order for the code to compile.

  2. I commented out the Browser.Visible line and added the comment as follows

    'if you need to debug the browser page, uncomment this line:'Browser.Visible = True
  3. And I added a line to close the browser before Set Browser = Nothing:

    Browser.Quit

Thanks again dee!

ETA: this works on machines with IE9, but not machines with IE8. Anyone have a fix?

Found the fix myself, so came back here to post it. The ClassName function is available in IE9. For this to work in IE8, you use querySelectorAll, with a dot preceding the class name of the object you are looking for:

'Set repList = doc.getElementsByClassName("reportList") 'only works in IE9, not in IE8Set repList = doc.querySelectorAll(".reportList")       'this works in IE8+