Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch elasticsearch elasticsearch

Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch


While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:

1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.3) Interpolate in your graph and maybe try out 3 values between your best values of run 2

Then use the final result as your optimal stream bulk insertion size.

Once you have this value, you can add one more step:Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.

This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.