What is the default MapReduce join used by Apache Hive? What is the default MapReduce join used by Apache Hive? hadoop hadoop

What is the default MapReduce join used by Apache Hive?


The 'default' join would be the shuffle join, aka. as common-join. See JoinOperator.java. It relies on M/R shuffle to partition the data and the join is done on the Reduce side. As is a size-of-data copy during the shuffle, it is slow.

A much better option is the MapJoin, see MapJoinOpertator.java. This works if you have only one big table and one or more small tables to join against (eg. typical star schema). The small tables are scanned first, a hash table is built and uploaded into the HDFS cache and then the M/R job is launched which only needs to split one table (the big table). Is much more efficient than shuffle join, but requires the small table(s) to fit in memory of the M/R map tasks. Normally Hive (at least since 0.11) will try to use MapJoin, but it depends on your configs.

A specialized join is the bucket-sort-merge join, aka. SMBJoin, see SMBJoinOperator.java. This works if you have 2 big tables that match the bucketing on the join key. The M/R job splits then can be arranged so that a map task gest only splits form the two big tables that are guaranteed to over overlap on the join key so the map task can use a hash table to do the join.

There are more details, like skew join support and fallback in out-of-memory conditions, but this should probably get you started into investigating your needs.

A very good presentation on the subject of joins is Join Strategies in Hive. Keep in mind that things evolve fast an a presentaiton from 2011 is a bit outdated.


Do an explain on the Hive query and you can see the execution plan.