Hadoop File Splits : CompositeInputFormat : Inner Join Hadoop File Splits : CompositeInputFormat : Inner Join hadoop hadoop

Hadoop File Splits : CompositeInputFormat : Inner Join


Unfortunately, CompositeInputFormat has to ignore the block/split size. In CompositeInputFormat, the input files need to be sorted and partitioned identically... therefore, Hadoop has no way to determine where to split the file to maintain this property. It has no way to determine where to split the file to keep the files organized.

The only way to get around this is to split and partition the files manually into smaller splits. You can do this by passing the data through a mapreduce job (probably just identity mapper and identity reducer) with a larger amount of reducers. Just be sure to pass both of your data sets through with the same number of reducers.