How do I extract the text content from a word document with PHP?
Try to create your reader before
$source = "word.doc";// create your reader object$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');// read sourceif($phpWordReader->canRead($source)) {$phpWord = $phpWordReader->load($source);... // rest of your code}
Answer is based on this example and API documentation
Rather than check each class for text, you can use
$sections = $phpWord->getSections(); foreach ($sections as $s) { $els = $s->getElements(); /** @var ElementTest $e */ foreach ($els as $e) { $class = get_class($e); if (method_exists($class, 'getText')) { $text .= $e->getText(); } else { $text .= "\n"; } } }
You can extract txt from a word document using catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/
It can be installed on Ubuntu using
sudo apt-get install catdoc
Once you have catdoc working on your system you can call it from php using shell_exec()
<?php$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');print $text;?>
Be sure to substitute (fullpath) with the actual path to catdoc and your word doc.
EDIT ---- Addition
If you can save your files as .docx rather than .doc it is a little bit easier. You can use unzip rather than catdoc.
Simply replace:
$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');
with
$text = shell_exec("/(fullpath)/unzip -p /(fullpath)/word.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");
You could use this same technique with most other command line document to text converters. Just replace the command in the shell_exec() with the command that works on your system. You can check How to extract just plain text from .doc & .docx files? (unix) for other unix/linux alternatives
For other PHP alternatives check out How to extract text from word file .doc,docx,.xlsx,.pptx php