Reading Nested data from ElasticSearch via Spark Scala Reading Nested data from ElasticSearch via Spark Scala elasticsearch elasticsearch

Reading Nested data from ElasticSearch via Spark Scala


Out of the top of my head, this error occurs when the schema guessed by the spark/ElasticSearch connector is not actually compatible with the data being read.

Keep in my that ES is schemaless, and SparkSQL has a "hard" schema. Bridging this gap is not always possible, so it's all just a best effort.

When connecting the two, the connector samples the documents and tries to guess a schema : "field A is a string, field B is an object structure with two subfield : B.1 being a date, and B.2 being an array of strings, ... whatever".

If it guessed wrong (typically : a given column / subcolumn is guessed as being a String, but in some documents it in fact is an array or a number), then the JSON to SparkSQL conversion emits those kind of errors.

In the words of the documentation, it states :

Elasticsearch treats fields with single or multi-values the same; in fact, the mapping provides no information about this. As a client, it means one cannot tell whether a field is single-valued or not until is actually being read. In most cases this is not an issue and elasticsearch-hadoop automatically creates the necessary list/array on the fly. However in environments with strict schema such as Spark SQL, changing a field actual value from its declared type is not allowed. Worse yet, this information needs to be available even before reading the data. Since the mapping is not conclusive enough, elasticsearch-hadoop allows the user to specify the extra information through field information, specifically es.read.field.as.array.include and es.read.field.as.array.exclude.

So I'd adivse you to check that the schema you reported in your question (the schema guessed by Spark) is actually valid agains all your documents, or not.

If it's not, you have a few options going forward :

  1. Correct the mapping individually. If the problem is linked to an array type not being recognized as such, you can do so using configuration options. You can see the es.read.field.as.array.include (resp. .exclude) option (which is used to actively tell Spark which properties in the documents are array (resp. not array). If a field is unused, es.read.field.exclude is an option that will exclude a given field from Spark altogether, bypassing possible schema issus for it.

  2. If there is no way to provide a valid schema for all cases to ElasticSearch (e.g. some field is sometimes a number, somtimes a string, and there is no way to tell), then basically, you're stuck to going back at the RDD level (and if need be, go back to Dataset / Dataframe once the schema is well defined).