extracting text from MS word files in python
Use the native Python docx module. Here's how to extract all the text from a doc:
document = docx.Document(filename)docText = '\n\n'.join( paragraph.text for paragraph in document.paragraphs)print(docText)
See Python DocX site
Also check out Textract which pulls out tables etc.
Parsing XML with regexs invokes cthulu. Don't do it!