Flexible schema possible with ORC or Parquet format?

hadoop bigdata parquet orc

I think you may be trying to fit a round peg in a square hole. It sounds like you are ingesting a stream of events with an unknown schema, and you would like to store it in a format that optimizes for a known schema.

I suppose you could buffer a set number of events (say, 1 million events) while keeping track of the schema, then purge to a file once the number is reached and begin buffering again. The drawback is each file will end up with a different schema, making it impractical to process data across multiple files.

A different solution would be to look into schemaless data stores, although you don't get the same price-performance benefits as with ORC or Parquet on S3.

There are other strategies as well, but your best bet for a long term solution is to talk to whomever manages the source of the events you are ingesting and find a way to determine the schema up front.

CodeHunter

Flexible schema possible with ORC or Parquet format?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last