html parsing of cricinfo scorecards html parsing of cricinfo scorecards vba vba

html parsing of cricinfo scorecards


There are 2 techniques that I use for "VBA". I will describe them 1 by one.

1) Using FireFox / Firebug Addon / Fiddler

2) Using Excel's inbuilt facility to get data from the web

Since this post will be read by many so I will even cover the obvious. Please feel free to skip whatever part you know


1) Using FireFox / Firebug Addon / Fiddler


FireFox : http://en.wikipedia.org/wiki/Firefox Free download (http://www.mozilla.org/en-US/firefox/new/)

Firebug Addon: http://en.wikipedia.org/wiki/Firebug_%28software%29 Free download (https://addons.mozilla.org/en-US/firefox/addon/firebug/)

Fiddler : http://en.wikipedia.org/wiki/Fiddler_%28software%29 Free download (http://www.fiddler2.com/fiddler2/)

Once you have installed Firefox, install the Firebug Addon. The Firebug Addon lets you inspect the different elements in a webpage. For example if you want to know the name of a button, simply right click on it and click on "Inspect Element with Firebug" and it will give you all the details that you will need for that button.

enter image description here

Another example would be finding the name of a table on a website which has the data that you need scrapped.

I use Fiddler only when I am using XMLHTTP. It helps me to see the exact info being passed when you click on a button. Because of the increase in the number of BOTS which scrape the sites, most sites now, to prevent automatic scrapping, capture your mouse coordinates and pass that information and fiddler actually helps you in debugging that info that is being passed. I will not get into much details here about it as this info can be used maliciously.

Now let's take a simple example on how to scrape the URL posted in your question

http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html

First let's find the name of the table which has that info. Simply right click on the table and click on "Inspect Element with Firebug" and it will give you the below snapshot.

enter image description here

So now we know that our data is stored in a table called "inningsBat1" If we can extract the contents of that table to an Excel file then we can definitely work with the data to do our analysis. Here is sample code which will dump that table in Sheet1

Before we proceed, I would recommend, closing all Excel and starting a fresh instance.

Launch VBA and insert a Userform. Place a command button and a webcrowser control. Your Userform might look like this

enter image description here

Paste this code in the Userform code area

Option Explicit'~~> Set Reference to Microsoft HTML Object LibraryPrivate Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)Private Sub CommandButton1_Click()    Dim URL As String    Dim oSheet As Worksheet    Set oSheet = Sheets("Sheet1")    URL = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html"    PopulateDataSheets oSheet, URL    MsgBox "Data Scrapped. Please check " & oSheet.NameEnd SubPublic Sub PopulateDataSheets(wsk As Worksheet, URL As String)    Dim tbl As HTMLTable    Dim tr As HTMLTableRow    Dim insertRow As Long, Row As Long, col As Long    On Error GoTo whoa    WebBrowser1.navigate URL    WaitForWBReady    Set tbl = WebBrowser1.Document.getElementById("inningsBat1")    With wsk        .Cells.Clear        insertRow = 0        For Row = 0 To tbl.Rows.Length - 1            Set tr = tbl.Rows(Row)            If Trim(tr.innerText) <> "" Then                If tr.Cells.Length > 2 Then                    If tr.Cells(1).innerText <> "Total" Then                        insertRow = insertRow + 1                        For col = 0 To tr.Cells.Length - 1                            .Cells(insertRow, col + 1) = tr.Cells(col).innerText                        Next                    End If                End If            End If        Next    End Withwhoa:    Unload MeEnd SubPrivate Sub Wait(ByVal nSec As Long)    nSec = nSec + Timer    While Timer < nSec       DoEvents        Sleep 100    WendEnd SubPrivate Sub WaitForWBReady()    Wait 1    While WebBrowser1.ReadyState <> 4        Wait 3    WendEnd Sub

Now run your Userform and click on the Command button. You will notice that the data is dumped in Sheet1. See snapshot

enter image description here

Similarly you can scrape other info as well.


2) Using Excel's inbuilt facility to get data from the web


I believe you are using Excel 2007 so I will take that as an example to scrape the above mentioned link.

Navigate to Sheet2. Now navigate to Data Tab and click on the button "From Web" on the extreme right. See snapshot.

enter image description here

Enter the url in the "New Web Query Window" and click on "Go"

Once the page is uploaded, select the relevant table that you want to import by clicking on the small arrow as shown in the snapshot. Once done, click on "Import"

enter image description here

Excel will then ask you where you want the data to be imported. Select the relevant cell and click on OK. And you are done! The data will be imported to the cell which you specified.

If you wish you can record a macro and automate this as well :)

Here is the macro that I recorded.

Sub Macro1()    With ActiveSheet.QueryTables.Add(Connection:= _    "URL;http://www.espncricinfo.com/big-bash-league-2011/engine/match/524915.html" _    , Destination:=Range("$A$1"))        .Name = "524915"        .FieldNames = True        .RowNumbers = False        .FillAdjacentFormulas = False        .PreserveFormatting = True        .RefreshOnFileOpen = False        .BackgroundQuery = True        .RefreshStyle = xlInsertDeleteCells        .SavePassword = False        .SaveData = True        .AdjustColumnWidth = True        .RefreshPeriod = 0        .WebSelectionType = xlSpecifiedTables        .WebFormatting = xlWebFormattingNone        .WebTables = """inningsBat1"""        .WebPreFormattedTextToColumns = True        .WebConsecutiveDelimitersAsOne = True        .WebSingleBlockTextImport = False        .WebDisableDateRecognition = False        .WebDisableRedirections = False        .Refresh BackgroundQuery:=False    End WithEnd Sub

Hope this helps. Let me know if you still have some queries.

Sid


For anyone else interested in this I ended up using the code below based on Siddhart Rout's earlier answer

  • XMLHttp was significantly quicker than automating IE
  • the code generates a CSV file for each series to be dowloaded (held in the X variable)
  • the code dumps each match to a regular 29 row range (regardless of how many players batted) to facillitate easier analysis later on

enter image description here

    Public Sub PopulateDataSheets_XML()    Dim URL As String    Dim ws As Worksheet    Dim lngRow As Long    Dim lngRecords As Long    Dim lngWrite As Long    Dim lngSpare As Long    Dim lngInnings As Long    Dim lngRow1 As Long    Dim X(1 To 15, 1 To 4) As String    Dim objFSO As Object    Dim objTF As Object    Dim xmlHttp As Object    Dim htmldoc As HTMLDocument    Dim htmlbody As htmlbody    Dim tbl As HTMLTable    Dim tr As HTMLTableRow    Dim strInnings As String    s = Timer()    Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")    Set objFSO = CreateObject("scripting.filesystemobject")    X(1, 1) = "http://www.espncricinfo.com/indian-premier-league-2011/engine/match/"    X(1, 2) = 501198    X(1, 3) = 501271    X(1, 4) = "indian-premier-league-2011"    X(2, 1) = "http://www.espncricinfo.com/big-bash-league-2011/engine/match/"    X(2, 2) = 524915    X(2, 3) = 524945    X(2, 4) = "big-bash-league-2011"    X(3, 1) = "http://www.espncricinfo.com/ausdomestic-2010/engine/match/"    X(3, 2) = 461028    X(3, 3) = 461047    X(3, 4) = "big-bash-league-2010"    Set htmldoc = New HTMLDocument    Set htmlbody = htmldoc.body    For lngRow = 1 To UBound(X, 1)        If Len(X(lngRow, 1)) = 0 Then Exit For        Set objTF = objFSO.createtextfile("c:\temp\" & X(lngRow, 4) & ".csv")        For lngRecords = X(lngRow, 2) To X(lngRow, 3)            URL = X(lngRow, 1) & lngRecords & ".html"            xmlHttp.Open "GET", URL            xmlHttp.send            Do While xmlHttp.Status <> 200                DoEvents            Loop            htmlbody.innerHTML = xmlHttp.responseText            objTF.writeline X(lngRow, 1) & lngRecords & ".html"            For lngInnings = 1 To 2            strInnings = "Innings " & lngInnings                objTF.writeline strInnings                Set tbl = Nothing                On Error Resume Next                Set tbl = htmlbody.Document.getElementById("inningsBat" & lngInnings)                On Error GoTo 0                If Not tbl Is Nothing Then                    lngWrite = 0                    For lngRow1 = 0 To tbl.Rows.Length - 1                        Set tr = tbl.Rows(lngRow1)                        If Trim(tr.innerText) <> vbNewLine Then                            If tr.Cells.Length > 2 Then                                If tr.Cells(1).innerText <> "Extras" Then                                    If Len(tr.Cells(1).innerText) > 0 Then                                        objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)                                        lngWrite = lngWrite + 1                                    End If                                Else                                    objTF.writeline strInnings & "-" & lngWrite & "," & Trim(tr.Cells(1).innerText) & "," & Trim(tr.Cells(3).innerText)                                    lngWrite = lngWrite + 1                                    Exit For                                End If                            End If                        End If                    Next                    For lngSpare = 12 To lngWrite Step -1                        objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)                    Next                Else                    For lngSpare = 1 To 13                        objTF.writeline strInnings & "-" & lngWrite + (12 - lngSpare)                    Next                End If            Next        Next    Next    'Call ConsolidateSheetsEnd Sub


RegEx is not a complete solution for parsing HTML because it is not guaranteed to be regular.

You should use the HtmlAgilityPack to query the HTML. This will allow you to use the CSS selectors to query the HTML similar to how you do it with jQuery.