Best way to extract text from a Word doc without using COM/automation?

python ms-word

(Same answer as extracting text from MS word files in python)

Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:

document = opendocx('Hello world.docx')# This location is where most document content lives docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]# Extract all textprint getdocumenttext(document)

See Python DocX site

100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs.

python ms-word

I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

import osdef doc_to_text_catdoc(filename):    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)    fi.close()    retval = fo.read()    erroroutput = fe.read()    fo.close()    fe.close()    if not erroroutput:        return retval    else:        raise OSError("Executing the command caused an error: %s" % erroroutput)# similar doc_to_text_antiword()

The -w switch to catdoc turns off line wrapping, BTW.

python ms-word

If all you want to do is extracting text from Word files (.docx), it's possible to do it only with Python. Like Guy Starbuck wrote it, you just need to unzip the file and then parse the XML. Inspired by python-docx, I have written a simple function to do this:

try:    from xml.etree.cElementTree import XMLexcept ImportError:    from xml.etree.ElementTree import XMLimport zipfile"""Module that extract text from MS XML Word document (.docx).(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)"""WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'PARA = WORD_NAMESPACE + 'p'TEXT = WORD_NAMESPACE + 't'def get_docx_text(path):    """    Take the path of a docx file as argument, return the text in unicode.    """    document = zipfile.ZipFile(path)    xml_content = document.read('word/document.xml')    document.close()    tree = XML(xml_content)    paragraphs = []    for paragraph in tree.getiterator(PARA):        texts = [node.text                 for node in paragraph.getiterator(TEXT)                 if node.text]        if texts:            paragraphs.append(''.join(texts))    return '\n\n'.join(paragraphs)

CodeHunter

Best way to extract text from a Word doc without using COM/automation?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last