How to output multiple s3 files in Parquet How to output multiple s3 files in Parquet hadoop hadoop

How to output multiple s3 files in Parquet


In a typical Map-Reduce application, the number of output files will be the same as the number of reduces in your job. So if you want multiple output files, set the number of reduces accordingly:

job.setNumReduceTasks(N);

or alternatively via the system property:

-Dmapreduce.job.reduces=N

I don't think it is possible to have one column per file with the Parquet format. The internal structure of Parquet files is initially split by row groups, and only these row groups are then split by columns.

Parquet format