Are Hive's implicit joins always inner joins? Are Hive's implicit joins always inner joins? hadoop hadoop

Are Hive's implicit joins always inner joins?


Not always. Your queries are equivalent. But without WHERE t1.id = t2.id AND t2.id = t3.id it will be CROSS JOIN.

Update:

This is interesting question and I decided to add some demo. Let's create two tables:

A(c1 int, c2 string) and B(c1 int, c2 string).

Load data:

insert into table A select 1, 'row one' union all select 2, 'row two';insert into table B select 1, 'row one' union all select 3, 'row three';

Check data:

hive> select * from A;OK1       row one2       row twoTime taken: 1.29 seconds, Fetched: 2 row(s)hive> select * from B;OK1       row one3       row threeTime taken: 0.091 seconds, Fetched: 2 row(s)

Check cross join (implicit join without where transformed to cross):

hive> select a.c1, a.c2, b.c1, b.c2 from a,b;Warning: Map Join MAPJOIN[14][bigTable=a] in task 'Stage-3:MAPRED' is a cross productWarning: Map Join MAPJOIN[22][bigTable=b] in task 'Stage-4:MAPRED' is a cross productWarning: Shuffle Join JOIN[4][tables = [a, b]] in Stage 'Stage-1:MAPRED' is a cross productOK1       row one 1       row one2       row two 1       row one1       row one 3       row three2       row two 3       row threeTime taken: 54.804 seconds, Fetched: 4 row(s)

Check inner join (implicit join with where works as INNER):

hive> select a.c1, a.c2, b.c1, b.c2 from a,b where a.c1=b.c1;OK1       row one 1       row oneTime taken: 38.413 seconds, Fetched: 1 row(s)

Try to perform left join by adding OR b.c1 is null to the where:

hive> select a.c1, a.c2, b.c1, b.c2 from a,b where (a.c1=b.c1) OR (b.c1 is null);OK1       row one 1       row oneTime taken: 57.317 seconds, Fetched: 1 row(s)

As you can see we got inner join again. or b.c1 is null is ignored

Now left join without where and ON clause (transformed to CROSS):

 select a.c1, a.c2, b.c1, b.c2 from a left join b;OK1       row one 1       row one1       row one 3       row three2       row two 1       row one2       row two 3       row threeTime taken: 37.104 seconds, Fetched: 4 row(s)

As you can see we got cross again.

Try left join with where clause and without ON (works as INNER):

select a.c1, a.c2, b.c1, b.c2 from a left join b where a.c1=b.c1;OK1       row one 1       row oneTime taken: 40.617 seconds, Fetched: 1 row(s)

We got INNER join

Try left join with where clause and without ON+ try to allow nulls:

 select a.c1, a.c2, b.c1, b.c2 from a left join b where a.c1=b.c1 or b.c1 is null;OK1       row one 1       row oneTime taken: 53.873 seconds, Fetched: 1 row(s)

Again got INNER. or b.c1 is null is ignored.

Left join with on clause:

hive> select a.c1, a.c2, b.c1, b.c2 from a left join b on a.c1=b.c1;OK1       row one 1       row one2       row two NULL    NULLTime taken: 48.626 seconds, Fetched: 2 row(s)

Yes, it is true left join.

Left join with on + where (got INNER):

hive> select a.c1, a.c2, b.c1, b.c2 from a left join b on a.c1=b.c1 where a.c1=b.c1;OK1       row one 1       row oneTime taken: 49.54 seconds, Fetched: 1 row(s)

We got INNER because WHERE does not allow NULLS.

Left join with where + allow nulls:

hive> select a.c1, a.c2, b.c1, b.c2 from a left join b on a.c1=b.c1 where a.c1=b.c1 or b.c1 is null;OK1       row one 1       row one2       row two NULL    NULLTime taken: 55.951 seconds, Fetched: 2 row(s)

Yes, it is left join.

Conclusion:

  1. Implicit join works as INNNER (with where) or CROSS if without WHERE clause.
  2. Left join can work as CROSS if without ON and without WHERE, can also work as INNER if WHERE clause does not allows nullsfor right table.
  3. Better use ANSI syntax because it is self-explaining and it is easy to understand what do you expect it to work like. Implicit joins or left joins working as INNER or CROSS are difficult to understand and very prone to error.