Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch

java elasticsearch etl

While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:

1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.3) Interpolate in your graph and maybe try out 3 values between your best values of run 2

Then use the final result as your optimal stream bulk insertion size.

Once you have this value, you can add one more step:Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.

This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.

CodeHunter

Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last