Flexible schema possible with ORC or Parquet format? Flexible schema possible with ORC or Parquet format? hadoop hadoop

Flexible schema possible with ORC or Parquet format?


I think you may be trying to fit a round peg in a square hole. It sounds like you are ingesting a stream of events with an unknown schema, and you would like to store it in a format that optimizes for a known schema.

I suppose you could buffer a set number of events (say, 1 million events) while keeping track of the schema, then purge to a file once the number is reached and begin buffering again. The drawback is each file will end up with a different schema, making it impractical to process data across multiple files.

A different solution would be to look into schemaless data stores, although you don't get the same price-performance benefits as with ORC or Parquet on S3.

There are other strategies as well, but your best bet for a long term solution is to talk to whomever manages the source of the events you are ingesting and find a way to determine the schema up front.