How to read contents of an Table in MS-Word file Using Python? How to read contents of an Table in MS-Word file Using Python? python python

How to read contents of an Table in MS-Word file Using Python?


Jumping in rather late in life, but thought I'd put this out anyway:Now (2015), you can use the pretty neat doc python library:https://python-docx.readthedocs.org/en/latest/. And then:

from docx import DocumentwordDoc = Document('<path to docx file>')for table in wordDoc.tables:    for row in table.rows:        for cell in row.cells:            print cell.text


Here is what works for me in Python 2.7:

import win32com.client as win32word = win32.Dispatch("Word.Application")word.Visible = 0word.Documents.Open("MyDocument")doc = word.ActiveDocument

To see how many tables your document has:

doc.Tables.Count

Then, you can select the table you want by its index. Note that, unlike python, COM indexing starts at 1:

table = doc.Tables(1)

To select a cell:

table.Cell(Row = 1, Column= 1)

To get its content:

table.Cell(Row =1, Column =1).Range.Text

Hope that this helps.

EDIT:

An example of a function that returns Column index based on its heading:

def Column_index(header_text):for i in range(1 , table.Columns.Count+1):    if table.Cell(Row = 1,Column = i).Range.Text == header_text:        return i

then you can access the cell you want this way for example:

table.Cell(Row =1, Column = Column_index("The Column Header") ).Range.Text


I found a simple code snippet on a blog Reading Table Contents Using Python by etienne

The great thing about this is that you don't need any non-standard python libraries installed.

The format of a docx file is described at Open Office XML.

import zipfileimport xml.etree.ElementTreeWORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'PARA = WORD_NAMESPACE + 'p'TEXT = WORD_NAMESPACE + 't'TABLE = WORD_NAMESPACE + 'tbl'ROW = WORD_NAMESPACE + 'tr'CELL = WORD_NAMESPACE + 'tc'with zipfile.ZipFile('<path to docx file>') as docx:    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))for table in tree.iter(TABLE):    for row in table.iter(ROW):        for cell in row.iter(CELL):            print ''.join(node.text for node in cell.iter(TEXT))