Hadoop ORC file - How it works - How to fetch metadata Hadoop ORC file - How it works - How to fetch metadata hadoop hadoop

Hadoop ORC file - How it works - How to fetch metadata


1. and 2. Use Hive and/or HCatalog to create, read, update ORC table structure in the Hive metastore (HCatalog is just a side door than enables Pig/Sqoop/Spark/whatever to access the metastore directly)

2. ALTER TABLE command allows to add/drop columns whatever the storage type, ORC included. But beware of a nasty bug that may crash vectorized reads after that (at least in V0.13 and V0.14)

3. and 4. The term "index" is rather inappropriate. Basically it's just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE requirements, drastically reducing I/O in some cases (a trick that has become popular in columns stores e.g. InfoBright on MySQL, but also in Oracle Exadata appliances [dubbed "smart scan" by Oracle marketing])

5. Hive works with "row store" formats (Text, SequenceFile, AVRO) and "column store" formats (ORC, Parquet) alike. The optimizer just uses specific strategies and shortcuts on the initial Map phase -- e.g. stripe elimination, vectorized operators -- and of course the serialization/deserialization phases are a bit more elaborate with column stores.


Hey i can not help you with all of your questions but i'll give it a try

  1. you can use the filedump utility to read out the metadata of an ORC-file see here

  2. I am very unsure about the schema evolution but as far as i know ORC does not support evolution.

  3. ORC index stores sum min and max so if your data is totally unstructured you probably would still have to read a lot of data. But since the latest release of ORC you can anable an additional Bloom-Filter which is more accurate in row group elimination. Maybe this could be helpful too orc-user mailing list

  4. ORC provides an index for every column but it's just a light weight index. You store information about min/max and sum on numeric columns in the filefooter, stripefooter and by default every 10000 rows. so it does not take that much space

  5. If you store your table in Orc Fileformat Hive will use an specific ORC Recordreader to extract the rows from the columns. The advantage of columnar storage is that you do not have to read the whole row