Generate metadata for parquet files

hadoop apache-spark hive parquet

Ok so here is the drill, metadata can be accessed directly using Parquet tools. You'll need to get the footers for your parquet file first :

import scala.collection.JavaConverters.{collectionAsScalaIterableConverter, mapAsScalaMapConverter}import org.apache.parquet.hadoop.ParquetFileReaderimport org.apache.hadoop.fs.{FileSystem, Path}import org.apache.hadoop.conf.Configurationval conf = spark.sparkContext.hadoopConfigurationdef getFooters(conf: Configuration, path: String) = {  val fs = FileSystem.get(conf)  val footers = ParquetFileReader.readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))  footers}

Now you can get your file metadata as followed :

def getFileMetadata(conf: Configuration, path: String) = {  getFooters(conf, path)    .asScala.map(_.getParquetMetadata.getFileMetaData.getKeyValueMetaData.asScala)}

Now you can get the metadata of your parquet file :

getFileMetadata(conf, "/tmp/foo").headOption// Option[scala.collection.mutable.Map[String,String]] =//   Some(Map(org.apache.spark.sql.parquet.row.metadata ->//     {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{"foo":"bar"}}//     {"name":"txt","type":"string","nullable":true,"metadata":{}}]}))

We can also use extracted footers to write standalone metadata file when needed:

import org.apache.parquet.hadoop.ParquetFileWriterdef createMetadata(conf: Configuration, path: String) = {  val footers = getFooters(conf, path)  ParquetFileWriter.writeMetadataFile(conf, new Path(path), footers)}

I hope this answers your question. You can read more about Spark DataFrames and Metadata on awesome-spark's spark-gotchas repo.

CodeHunter

Generate metadata for parquet files

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last