Best practices for searchable archive of thousands of documents (pdf and/or xml)

xml pdf lucene full-text-search elasticsearch

In summary: I'm going to be recommending ElasticSearch, but let's break the problem down and talk about how to implement it:

There are a few parts to this:

Extracting the text from your docs to make them indexable
Making this text available as full text search
Returning highlighted snippets of the doc
Knowing where in the doc those snippets are found to allowfor paging
Return the full doc

What can ElasticSearch provide:

ElasticSearch (like Solr) uses Tika to extract text and metadata from a wide variety of doc formats
It, pretty obviously, provides powerful full text search. It can be configuredto analyse each doc in the appropriate language with, stemming, boosting the relevance of certain fields (eg title more important than content), ngrams etc. ie standard Lucene stuff
It can return highlighted snippets for each search result
It DOESN'T know where those snippets occur in your doc
It can store the original doc as an attachment, or it can store and return the extracted text. But it'll return the whole doc, not a page.

You could just send the whole doc to ElasticSearch as an attachment, and you'd get full text search. But the sticking points are (4) and (5) above: knowing where you are in a doc, and returning parts of a doc.

Storing individual pages is probably sufficient for your where-am-I purposes (although you could equally go down to paragraph level), but you want them grouped in a way that a doc would be returned in the search results, even if search keywords appear on different pages.

First the indexing part: storing your docs in ElasticSearch:

Use Tika (or whatever you're comfortable with) to extract the text from each doc. Leave it as plain text, or as HTML to preserve some formatting. (forget about XML, no need for it).
Also extract the metadata for each doc: title, authors, chapters, language, dates etc
Store the original doc in your filesystem, and record the path so that you can serve it later
In ElasticSearch, index a "doc" doc which contains all of the metadata, and possibly the list of chapters
Index each page as a "page" doc, which contains:
- A parent field which contains the ID of the "doc" doc (see "Parent-child relationship" below)
- The text
- The page number
- Maybe the chapter title or number
- Any metadata which you want to be searchable

Now for searching. How you do this depends on how you want to present your results - by page, or grouped by doc.

Results by page are easy. This query returns a list of matching pages (each page is returned in full) plus a list of highlighted snippets from the page:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '{   "query" : {      "text" : {         "text" : "interesting keywords"      }   },   "highlight" : {      "fields" : {         "text" : {}      }   }}'

Displaying results grouped by "doc" with highlights from the text is a bit trickier. It can't be done with a single query, but a little client side grouping will get you there. One approach might be:

Step 1: Do a top-children-query to find the parent ("doc") whose children ("page") best match the query:

curl -XGET 'http://127.0.0.1:9200/my_index/doc/_search?pretty=1'  -d '{   "query" : {      "top_children" : {         "query" : {            "text" : {               "text" : "interesting keywords"            }         },         "score" : "sum",         "type" : "page",         "factor" : "5"      }   }}

Step 2: Collect the "doc" IDs from the above query and issue a new query to get the snippets from the matching "page" docs:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '{   "query" : {      "filtered" : {         "query" : {            "text" : {               "text" : "interesting keywords"            }         },         "filter" : {            "terms" : {               "doc_id" : [ 1,2,3],            }         }      }   },   "highlight" : {      "fields" : {         "text" : {}      }   }}'

Step 3: In your app, group the results from the above query by doc and display them.

With the search results from the second query, you already have the full text of the page which you can display. To move to the next page, you can just search for it:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '{   "query" : {      "constant_score" : {         "filter" : {            "and" : [               {                  "term" : {                     "doc_id" : 1                  }               },               {                  "term" : {                     "page" : 2                  }               }            ]         }      }   },   "size" : 1}'

Or alternatively, give the "page" docs an ID consisting of $doc_id _ $page_num (eg 123_2) then you can just retrieve that page:

curl -XGET 'http://127.0.0.1:9200/my_index/page/123_2

Parent-child relationship:

Normally, in ES (and most NoSQL solutions) each doc/object is independent - there are no real relationships. By establishing a parent-child relationship between the "doc" and the "page", ElasticSearch makes sure that the child docs (ie the "page") are stored on the same shard as the parent doc (the "doc").

This enables you to run the top-children-query which will find the best matching "doc" based on the content of the "pages".

xml pdf lucene full-text-search elasticsearch

I've built and maintain an application that indexes and searches 70k+ PDF documents. I found it was necessarily to pull out the plain text from the PDFs, store the contents in SQL and index the SQL table using Lucene. Otherwise, performance was horrible.

xml pdf lucene full-text-search elasticsearch

Use Sunspot or RSolr or similar, it handles most major document formats. They use Solr/Lucene.

CodeHunter

Best practices for searchable archive of thousands of documents (pdf and/or xml)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last