Hadoop ORC file - How it works - How to fetch metadata

hadoop hive file-format orc

1. and 2. Use Hive and/or HCatalog to create, read, update ORC table structure in the Hive metastore (HCatalog is just a side door than enables Pig/Sqoop/Spark/whatever to access the metastore directly)

2. ALTER TABLE command allows to add/drop columns whatever the storage type, ORC included. But beware of a nasty bug that may crash vectorized reads after that (at least in V0.13 and V0.14)

3. and 4. The term "index" is rather inappropriate. Basically it's just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE requirements, drastically reducing I/O in some cases (a trick that has become popular in columns stores e.g. InfoBright on MySQL, but also in Oracle Exadata appliances [dubbed "smart scan" by Oracle marketing])

5. Hive works with "row store" formats (Text, SequenceFile, AVRO) and "column store" formats (ORC, Parquet) alike. The optimizer just uses specific strategies and shortcuts on the initial Map phase -- e.g. stripe elimination, vectorized operators -- and of course the serialization/deserialization phases are a bit more elaborate with column stores.

hadoop hive file-format orc

Hey i can not help you with all of your questions but i'll give it a try

you can use the filedump utility to read out the metadata of an ORC-file see here
I am very unsure about the schema evolution but as far as i know ORC does not support evolution.
ORC index stores sum min and max so if your data is totally unstructured you probably would still have to read a lot of data. But since the latest release of ORC you can anable an additional Bloom-Filter which is more accurate in row group elimination. Maybe this could be helpful too orc-user mailing list
ORC provides an index for every column but it's just a light weight index. You store information about min/max and sum on numeric columns in the filefooter, stripefooter and by default every 10000 rows. so it does not take that much space
If you store your table in Orc Fileformat Hive will use an specific ORC Recordreader to extract the rows from the columns. The advantage of columnar storage is that you do not have to read the whole row

CodeHunter

Hadoop ORC file - How it works - How to fetch metadata

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last