What is the right way to snyc/import tables from a postgres DB to elasticsearch? What is the right way to snyc/import tables from a postgres DB to elasticsearch? elasticsearch elasticsearch

What is the right way to snyc/import tables from a postgres DB to elasticsearch?


It depends on your use case. A common practice is to handle this on the application layer. Basically what you do is to replicate the actions of one db to the other. So for example if you save one entry in postgres you do the same in elasticsearch.

If you do this however you'll have to have a queuing system in place. Either the queue is integrated on your application layer, e.g. if the save in elasticsearch fails then you can replay the operation. Moreover on your queuing system you'll implement a throttling mechanism in order to not overwhelm elasticsearch. Another approach would be to send events to another app (e.g. logstash etc), so the throttling and persistence will be handled by that system and not your application.

Another approach would be this https://www.elastic.co/blog/logstash-jdbc-input-plugin. You use another system that "polls" your database and sends the changes to elasticsearch. In this case logstash is ideal since it's part of the ELK stack and it has a great integration. Check this too https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html

Another approach is to use the NOTIFY mechanism of postgres to send events to some queue that will handle saving the changes in elasticsearch.


There is a more recent tool called "abc", developped by appbase.ioIt's performance is uncomparable with logstash:- abc is based on go- logstash is jruby

Anybody who's ever used logstash knows that it takes at least 20 seconds just to start.

The same basic table import task from postgresql to elasticsearch takes ~1 min on logstash, and 5 seconds with abc

Pros:

  • Performance
  • Performance
  • Simplicity (no conf)

Cons:

  • More adapted for one-shot imports, the daemon mode is limited
  • Less middlewares (logstash filters) as you are required to write a transform.js file that manually changes events


As anything in life,best is subjective. Your colleague likes to write and maintain code to keep this in sync. There's nothing wrong with that.

I would say the best way would be to use some data pipeline. There's plethora of choices, really overwheleming, you can explore the various solutions which support Postgres and ElasticSearch. Here are options I'm familiar with.

Note that these are tools/platform for your solution, not the solution itself. YOU have to configure, customize and enhance them to fit your definition of in sync