How can I load Avros in Spark using the schema on-board the Avro file(s)? How can I load Avros in Spark using the schema on-board the Avro file(s)? hadoop hadoop

How can I load Avros in Spark using the schema on-board the Avro file(s)?


To answer my own question:

import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.avro.generic.GenericRecordimport org.apache.avro.mapred.AvroKeyimport org.apache.avro.mapred.AvroInputFormatimport org.apache.avro.mapreduce.AvroKeyInputFormatimport org.apache.hadoop.io.NullWritableimport org.apache.commons.lang.StringEscapeUtils.escapeCsvimport org.apache.hadoop.fs.FileSystemimport org.apache.hadoop.fs.Pathimport org.apache.hadoop.conf.Configurationimport java.io.BufferedInputStreamimport org.apache.avro.file.DataFileStreamimport org.apache.avro.io.DatumReaderimport org.apache.avro.file.DataFileReaderimport org.apache.avro.file.DataFileReaderimport org.apache.avro.generic.{GenericDatumReader, GenericRecord}import org.apache.avro.mapred.FsInputimport org.apache.avro.Schemaimport org.apache.avro.Schema.Parserimport org.apache.hadoop.mapred.JobConfimport java.io.Fileimport java.net.URI// spark-shell -usejavacp -classpath "*.jar"val input = "hdfs://hivecluster2/securityx/web_proxy_mef/2014/05/29/22/part-m-00016.avro"val jobConf= new JobConf(sc.hadoopConfiguration)val rdd = sc.hadoopFile(  input,  classOf[org.apache.avro.mapred.AvroInputFormat[GenericRecord]],  classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]],  classOf[org.apache.hadoop.io.NullWritable],  10)val f1 = rdd.firstval a = f1._1.datuma.get("rawLog") // Access avro fields


This works for me:

import org.apache.avro.generic.GenericRecordimport org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}import org.apache.hadoop.io.NullWritable...val path = "hdfs:///path/to/your/avro/folder"val avroRDD = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](path)