How to reduce number of output files in Apache Hive

hadoop mapreduce hive emr

Limiting the number of output files means you want to limit the number of reducers. You could do that with the help of mapred.reduce.tasks property from the Hive shell. Example :

hive>  set mapred.reduce.tasks = 5;

But it might affect the performance of your query. Alternatively, you could use getmerge command from the HDFS shell once you are done with your query. This command takes a source directory and a destination file as input and concatenates files in src into the destination local file.

Usage :

bin/hadoop fs -getmerge <src> <localdst>

HTH

hadoop mapreduce hive emr

See https://community.cloudera.com/t5/Support-Questions/Hive-Multiple-Small-Files/td-p/204038

set hive.merge.mapfiles=true;     -- Merge small files at the end of a map-only job.set hive.merge.mapredfiles=true;  -- Merge small files at the end of a map-reduce job.set hive.merge.size.per.task=???; -- Size (bytes) of merged files at the end of the job.set hive.merge.smallfiles.avgsize=??? -- File size (bytes) threshold-- When the average output file size of a job is less than this number, -- Hive will start an additional map-reduce job to merge the output files -- into bigger files. This is only done for map-only jobs if hive.merge.mapfiles -- is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

CodeHunter

How to reduce number of output files in Apache Hive

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last