Generate metadata for parquet files Generate metadata for parquet files hadoop hadoop

Generate metadata for parquet files


Ok so here is the drill, metadata can be accessed directly using Parquet tools. You'll need to get the footers for your parquet file first :

import scala.collection.JavaConverters.{collectionAsScalaIterableConverter, mapAsScalaMapConverter}import org.apache.parquet.hadoop.ParquetFileReaderimport org.apache.hadoop.fs.{FileSystem, Path}import org.apache.hadoop.conf.Configurationval conf = spark.sparkContext.hadoopConfigurationdef getFooters(conf: Configuration, path: String) = {  val fs = FileSystem.get(conf)  val footers = ParquetFileReader.readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))  footers}

Now you can get your file metadata as followed :

def getFileMetadata(conf: Configuration, path: String) = {  getFooters(conf, path)    .asScala.map(_.getParquetMetadata.getFileMetaData.getKeyValueMetaData.asScala)}

Now you can get the metadata of your parquet file :

getFileMetadata(conf, "/tmp/foo").headOption// Option[scala.collection.mutable.Map[String,String]] =//   Some(Map(org.apache.spark.sql.parquet.row.metadata ->//     {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{"foo":"bar"}}//     {"name":"txt","type":"string","nullable":true,"metadata":{}}]}))

We can also use extracted footers to write standalone metadata file when needed:

import org.apache.parquet.hadoop.ParquetFileWriterdef createMetadata(conf: Configuration, path: String) = {  val footers = getFooters(conf, path)  ParquetFileWriter.writeMetadataFile(conf, new Path(path), footers)}

I hope this answers your question. You can read more about Spark DataFrames and Metadata on awesome-spark's spark-gotchas repo.