Extract Data from PDF and Add to Worksheet Extract Data from PDF and Add to Worksheet vba vba

Extract Data from PDF and Add to Worksheet


You can open the PDF file and extract its contents using the Adobe library (which I believe you can download from Adobe as part of the SDK, but it comes with certain versions of Acrobat as well)

Make sure to add the Library to your references too (On my machine it is the Adobe Acrobat 10.0 Type Library, but not sure if that is the newest version)

Even with the Adobe library it is not trivial (you'll need to add your own error-trapping etc):

Function getTextFromPDF(ByVal strFilename As String) As String   Dim objAVDoc As New AcroAVDoc   Dim objPDDoc As New AcroPDDoc   Dim objPage As AcroPDPage   Dim objSelection As AcroPDTextSelect   Dim objHighlight As AcroHiliteList   Dim pageNum As Long   Dim strText As String   strText = ""   If (objAvDoc.Open(strFilename, "") Then      Set objPDDoc = objAVDoc.GetPDDoc      For pageNum = 0 To objPDDoc.GetNumPages() - 1         Set objPage = objPDDoc.AcquirePage(pageNum)         Set objHighlight = New AcroHiliteList         objHighlight.Add 0, 10000 ' Adjust this up if it's not getting all the text on the page         Set objSelection = objPage.CreatePageHilite(objHighlight)         If Not objSelection Is Nothing Then            For tCount = 0 To objSelection.GetNumText - 1               strText = strText & objSelection.GetText(tCount)            Next tCount         End If      Next pageNum      objAVDoc.Close 1   End If   getTextFromPDF = strTextEnd Function

What this does is essentially the same thing you are trying to do - only using Adobe's own library. It's going through the PDF one page at a time, highlighting all of the text on the page, then dropping it (one text element at a time) into a string.

Keep in mind what you get from this could be full of all kinds of non-printing characters (line feeds, newlines, etc) that could even end up in the middle of what look like contiguous blocks of text, so you may need additional code to clean it up before you can use it.

Hope that helps!


I know this is an old issue but I just had to do this for a project at work, and I am very surprised that nobody has thought of this solution yet:Just open the .pdf with Microsoft word.

The code is a lot easier to work with when you are trying to extract data from a .docx because it opens in Microsoft Word. Excel and Word play well together because they are both Microsoft programs. In my case, the file of question had to be a .pdf file. Here's the solution I came up with:

  1. Choose the default program to open .pdf files to be Microsoft Word
  2. The first time you open a .pdf file with word, a dialogue box pops up claiming word will need to convert the .pdf into a .docx file. Click the check box in the bottom left stating "do not show this message again" and then click OK.
  3. Create a macro that extracts data from a .docx file. I used MikeD's Code as a resource for this.
  4. Tinker around with the MoveDown, MoveRight, and Find.Execute methods to fit the need of your task.

Yes you could just convert the .pdf file to a .docx file but this is a much simpler solution in my opinion.


Over time, I have found that extracting text from PDFs in a structured format is tough business. However if you are looking for an easy solution, you might want to consider XPDF tool pdftotext.

Pseudocode to extract the text would include:

  1. Using SHELL VBA statement to extract the text from PDF to a temporary file using XPDF
  2. Using sequential file read statements to read the temporary file contents into a string
  3. Pasting the string into Excel

Simplified example below:

    Sub ReadIntoExcel(PDFName As String)        'Convert PDF to text        Shell "C:\Utils\pdftotext.exe -layout " & PDFName & " tempfile.txt"        'Read in the text file and write to Excel        Dim TextLine as String        Dim RowNumber as Integer        Dim F1 as Integer        RowNumber = 1        F1 = Freefile()        Open "tempfile.txt" for Input as #F1            While Not EOF(#F1)                Line Input #F1, TextLine                ThisWorkbook.WorkSheets(1).Cells(RowNumber, 1).Value = TextLine                RowNumber = RowNumber + 1            Wend        Close #F1    End Sub