Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

pdf solr elasticsearch ms-office

Both Solr and Elasticsearch will index the content of the document. Solr has that built-in, Elasticsearch needs a plugin. Easy either way and both use Tika under the covers.

Neither of them will store the document itself. You can try making them do it, but they are not designed for it and you will suffer.

Additionally, neither Solr nor Elasticsearch are currently recommended as a primary storage. They can do it, but it is not as mission critical for them as - say - for a filesystem implementation.

So, I would recommend having the files somewhere else and using Solr/Elasticsearch for searching only. That's where they shine.

pdf solr elasticsearch ms-office

I would try the Elasticsearch attachment plugin. Details can be found here:

https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/mapper-attachments.html

https://github.com/elasticsearch/elasticsearch-mapper-attachments

It's built on top of Apache Tika:

http://tika.apache.org/1.7/formats.html

Attachment Type

The attachment type allows to index different "attachment" type field (encoded as base64), for example, Microsoft Office formats, open document formats, ePub, HTML, and so on (full list can be found here).
The attachment type is provided as a plugin extension. The plugin is a simple zip file that can be downloaded and placed under $ES_HOME/plugins location. It will be automatically detected and the attachment type will be added.

Supported Document Formats

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
iWorks document formats
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Feed and Syndication formats
Help formats
Audio formats
Image formats
Video formats
Java class files and archives
Source code
Mail formats
CAD formats
Font formats
Scientific formats
Executable programs and libraries
Crypto formats

pdf solr elasticsearch ms-office

A bit late to the party but this may help someone :)

I had a similar problem and some research led me to fscrawler. Description:

This crawler helps to index binary documents such as PDF, Open Office, MS Office.

Main features:

Local file system (or a mounted drive) crawling and index new files,
update existing ones and removes old ones. Remote file system over SSHcrawling.
REST interface to let you "upload" your binary documents to elasticsearch.

CodeHunter

Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last