Searching over documents stored in Hadoop - which tool to use? Searching over documents stored in Hadoop - which tool to use? hadoop hadoop

Searching over documents stored in Hadoop - which tool to use?


Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.

But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.

1) Use hadoop cluster to store data.

2) Extract data in hadoop cluster using map/redcue

3) Do document identification( identify document type)

4) Extract meta data from these document.

5) Index metadata in solr server, store other ingestion information in database

6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.

7) When search is required search on all the indexs.

8) Solr supports all the complex searches , so you don't have to make your own search engine.

9) It also does paging for you as well.


We've done exactly this for some of our clients by using Solr as a "secondary indexer" to HBase. Updates to HBase are sent to Solr, and you can query against it. Typically folks start with HBase, and then graft search on. Sounds like you know from the get go that search is what you want, so you can probably embed the secondary indexing in from your pipeline that feeds HBase.

You may find though that just using Solr does everything you need.


Another project to look at is Lily, http://www.lilyproject.org/lily/index.html, which has already done the work of integrating Solr with a distributed database.

Also, I do not see why you would not want to use a browser for this application. You are describing exactly what faceted search is. While you certainly could set up a desktop app that communicates with the server (parses JSON) and displays the results in a thick client GUI, all of this work is already done for you in the browser. And, Solr comes with a free faceted search system out of the box: just follow along the tutorial.