How to use Bulk API to store the keywords in ES by using Python

python elasticsearch elasticsearch-bulk-api

from datetime import datetimefrom elasticsearch import Elasticsearchfrom elasticsearch import helperses = Elasticsearch()actions = [  {    "_index": "tickets-index",    "_type": "tickets",    "_id": j,    "_source": {        "any":"data" + str(j),        "timestamp": datetime.now()}  }  for j in range(0, 10)]helpers.bulk(es, actions)

python elasticsearch elasticsearch-bulk-api

Although @justinachen 's code helped me start with py-elasticsearch, after looking in the source code let me do a simple improvement:

es = Elasticsearch()j = 0actions = []while (j <= 10):    action = {        "_index": "tickets-index",        "_type": "tickets",        "_id": j,        "_source": {            "any":"data" + str(j),            "timestamp": datetime.now()            }        }    actions.append(action)    j += 1helpers.bulk(es, actions)

helpers.bulk() already does the segmentation for you. And by segmentation I mean the chucks sent every time to the server. If you want to reduce the chunk of sent documents do: helpers.bulk(es, actions, chunk_size=100)

Some handy info to get started:

helpers.bulk() is just a wrapper of the helpers.streaming_bulk but the first accepts a list which makes it handy.

helpers.streaming_bulk has been based on Elasticsearch.bulk() so you do not need to worry about what to choose.

So in most cases, helpers.bulk() should be all you need.

python elasticsearch elasticsearch-bulk-api

(the other approaches mentioned in this thread use python list for the ES update, which is not a good solution today, especially when you need to add millions of data to ES)

Better approach is using python generators -- process gigs of data without going out of memory or compromising much on speed.

Below is an example snippet from a practical use case - adding data from nginx log file to ES for analysis.

def decode_nginx_log(_nginx_fd):    for each_line in _nginx_fd:        # Filter out the below from each log line        remote_addr = ...        timestamp   = ...        ...        # Index for elasticsearch. Typically timestamp.        idx = ...        es_fields_keys = ('remote_addr', 'timestamp', 'url', 'status')        es_fields_vals = (remote_addr, timestamp, url, status)        # We return a dict holding values from each line        es_nginx_d = dict(zip(es_fields_keys, es_fields_vals))        # Return the row on each iteration        yield idx, es_nginx_d   # <- Note the usage of 'yield'def es_add_bulk(nginx_file):    # The nginx file can be gzip or just text. Open it appropriately.    ...    es = Elasticsearch(hosts = [{'host': 'localhost', 'port': 9200}])    # NOTE the (...) round brackets. This is for a generator.    k = ({            "_index": "nginx",            "_type" : "logs",            "_id"   : idx,            "_source": es_nginx_d,         } for idx, es_nginx_d in decode_nginx_log(_nginx_fd))    helpers.bulk(es, k)# Now, just run it.es_add_bulk('./nginx.1.log.gz')

This skeleton demonstrates the usage of generators. You can use this even on a bare machine if you need to. And you can go on expanding on this to tailor to your needs quickly.

Python Elasticsearch reference here.

CodeHunter

How to use Bulk API to store the keywords in ES by using Python

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last