Index a MySQL database with Apache Lucene, and keep them synchronized Index a MySQL database with Apache Lucene, and keep them synchronized mysql mysql

Index a MySQL database with Apache Lucene, and keep them synchronized


As long as you let the indexing/reindexing run separately from your application, you will have synchronization problems. Depending on your field of work, this might not be a problem, but for many concurrent-user-applications it is.

We had the same problems when we had a job system running asynchronous indexing every few minutes. Users would find a product using the search engine, then even when an administrative person removed the product from the valid product stack, still found it in the frontend, until the next reindexing job ran. This leads to very confusing and seldomly reproducable errors reported to first level support.

We saw two possibilities: Either connect the business logic tightly to updates of the search index, or implement a tighter asynchronous update task. We did the latter.

In the background, there's a class running in a dedicated thread inside the tomcat application that takes updates and runs them in parallel. The waiting times for backoffice updates to frontend are down to 0.5-2 seconds, which greatly reduces the problems for first level support. And, it is as loosely coupled as can be, we could even implement a different indexing engine.


Take a look at Solr DataImportScheduler approach.
It basically, when a web application starts, spawns a separate Timer thread, which periodically fires HTTP Post against Solr, which then uses DataImportHandler you set up to pull data from a RDB (and other data sources).

So, since you're not using Solr, only Lucene, you should take a look at DataImportHandler source for ideas.