Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc?? Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc?? hadoop hadoop

Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??


You should use "Talk is cheap, show me the code." methodology. Everything is not documented and one way to go is just the code.

Consider part-1-2_3-4.parquet :

  1. Split/Partition number.

  2. Random UUID to prevent collision between different (appending) write jobs.

  3. Unique Job/Task ID (sometimes it will not be included).
  4. The "c" stands for count. This is file counter which means the number of files that have been written in the past for this specific partition. This is used to limit the max number of records written for a single file. The value should start from 0.

I found it based on this code and this code.