How do I extract the text content from a word document with PHP? How do I extract the text content from a word document with PHP? php php

How do I extract the text content from a word document with PHP?


Try to create your reader before

$source = "word.doc";// create your reader object$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');// read sourceif($phpWordReader->canRead($source)) {$phpWord = $phpWordReader->load($source);... // rest of your code}

Answer is based on this example and API documentation


Rather than check each class for text, you can use

                    $sections = $phpWord->getSections();                    foreach ($sections as $s) {                        $els = $s->getElements();                        /** @var ElementTest $e */                        foreach ($els as $e) {                            $class = get_class($e);                            if (method_exists($class, 'getText')) {                                $text .= $e->getText();                            } else {                                $text .= "\n";                            }                        }                    }


You can extract txt from a word document using catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/

It can be installed on Ubuntu using

sudo apt-get install catdoc

Once you have catdoc working on your system you can call it from php using shell_exec()

<?php$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');print $text;?>

Be sure to substitute (fullpath) with the actual path to catdoc and your word doc.

EDIT ---- Addition

If you can save your files as .docx rather than .doc it is a little bit easier. You can use unzip rather than catdoc.

Simply replace:

$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');

with

$text = shell_exec("/(fullpath)/unzip -p /(fullpath)/word.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");

You could use this same technique with most other command line document to text converters. Just replace the command in the shell_exec() with the command that works on your system. You can check How to extract just plain text from .doc & .docx files? (unix) for other unix/linux alternatives

For other PHP alternatives check out How to extract text from word file .doc,docx,.xlsx,.pptx php