How to read specific fields from Avro-Parquet file in Java?
So...
Couple of things:
AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.$Schema)
can be used to set a projection for the columns that are selected.- The
reader.readNext
method still will return aClassA
object but will null out the fields that are not present inClassB
.
To use the reader directly you can do the following:
AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.SCHEMA$);final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());final ParquetReader<ClassA> reader = builder.withConf(hadoopConf).build();ClassA record = null;final List<ClassA> list = new ArrayList<>();while ((record = reader.read()) != null) { list.add(record);}
Also if you're planning to use an inputformat to read the avro-parquet file, there is a convenience method - here is a spark example:
final Job job = Job.getInstance(hadoopConf); ParquetInputFormat.setInputPaths(job, pathGlob); AvroParquetInputFormat.setRequestedProjection(job, ClassB.SCHEMA$); @SuppressWarnings("unchecked") final JavaPairRDD<Void, ClassA> rdd = sc.newAPIHadoopRDD(job.getConfiguration(), AvroParquetInputFormat.class, Void.class, ClassA.class);