Does Spark not support arraylist when writing to elasticsearch?

hadoop elasticsearch apache-spark

A bit late to the game, but this is the solution we came up with after running in to this yesterday. Add 'es.input.json': 'true' to your conf, and then run json.dumps() on your data.

Modifying your example, this would look like:

import jsonrdd = sc.parallelize([{"key1": ["val1", "val2"]}])json_rdd = rdd.map(json.dumps)json_rdd.saveAsNewAPIHadoopFile(     path='-',     outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",     keyClass="org.apache.hadoop.io.NullWritable",     valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",     conf={         "es.nodes" : "localhost",         "es.port" : "9200",         "es.resource" : "mboyd/mboydtype",        "es.input.json": "true"    })

hadoop elasticsearch apache-spark

Just had this problem, and the solution passes by converting all lists to tuples .Converting to json does same.

hadoop elasticsearch apache-spark

I feel there are a few points missing in other answers like you'll have to return a 2-tuple (I don't know why) from your RDD and will also need the Elasticsearch hadoop jar file to make it work. So I'll write the whole process that I had to follow to make it work.

Download the Elasticsearch Hadoop jar file. You can download it from the central maven repository (the latest version should work in most cases - check out their official requirements README for more).

Create a file run.py with the following minimal code snippet for the demonstration.

import jsonimport pymongo_sparkpymongo_spark.activate()from pyspark import SparkContext, SparkConfconf = SparkConf().setAppName('demo').setMaster('local')sc = SparkContext(conf=conf)rdd = sc.parallelize([{"key1": ["val1", "val2"]}])final_rdd = rdd.map(json.dumps).map(lambda x: ('key', x))final_rdd.saveAsNewAPIHadoopFile(    path='-',    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",    keyClass="org.apache.hadoop.io.NullWritable",    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",    conf={        "es.nodes" : "<server-ip>",        "es.port" : "9200",        "es.resource" : "index_name/doc_type_name",        "es.input.json": "true"    })

Run your Spark job with the following command ./bin/spark-submit --jars /path/to/your/jar/file/elasticsearch-hadoop-5.6.4.jar --driver-class-path /path/to/you/jar/file/elasticsearch-hadoop-5.6.4.jar --master yarn /path/to/your/run/file/run.py

HTH!

CodeHunter

Does Spark not support arraylist when writing to elasticsearch?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last