read an MSWord file into R read an MSWord file into R r r

read an MSWord file into R

First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.

The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.

The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.

Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.

In case it helps anyone else,, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).

I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.

  1. I converted a pdf to MSWord with Acrobat X Pro

  2. The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.

  3. Convert the MSWord file to a text file after deleting vertical lines in Step 2.

  4. Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.