How to read specific fields from Avro-Parquet file in Java? How to read specific fields from Avro-Parquet file in Java? hadoop hadoop

How to read specific fields from Avro-Parquet file in Java?


So...

Couple of things:

  • AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.$Schema) can be used to set a projection for the columns that are selected.
  • The reader.readNext method still will return a ClassA object but will null out the fields that are not present in ClassB.

To use the reader directly you can do the following:

AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.SCHEMA$);final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());final ParquetReader<ClassA> reader = builder.withConf(hadoopConf).build();ClassA record = null;final List<ClassA> list = new ArrayList<>();while ((record = reader.read()) != null) {    list.add(record);}

Also if you're planning to use an inputformat to read the avro-parquet file, there is a convenience method - here is a spark example:

        final Job job = Job.getInstance(hadoopConf);        ParquetInputFormat.setInputPaths(job, pathGlob);        AvroParquetInputFormat.setRequestedProjection(job, ClassB.SCHEMA$);        @SuppressWarnings("unchecked")        final JavaPairRDD<Void, ClassA> rdd = sc.newAPIHadoopRDD(job.getConfiguration(), AvroParquetInputFormat.class,                Void.class, ClassA.class);