extracting text from MS word files in python extracting text from MS word files in python python python

extracting text from MS word files in python


Use the native Python docx module. Here's how to extract all the text from a doc:

document = docx.Document(filename)docText = '\n\n'.join(    paragraph.text for paragraph in document.paragraphs)print(docText)

See Python DocX site

Also check out Textract which pulls out tables etc.

Parsing XML with regexs invokes cthulu. Don't do it!


You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.


benjamin's answer is a pretty good one. I have just consolidated...

import zipfile, redocx = zipfile.ZipFile('/path/to/file/mydocument.docx')content = docx.read('word/document.xml').decode('utf-8')cleaned = re.sub('<(.|\n)*?>','',content)print(cleaned)