Join Tables on Date Range in Hive Join Tables on Date Range in Hive hadoop hadoop

Join Tables on Date Range in Hive


RTFM - quoting LanguageManual Joins

Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.

You may try to move the BETWEEN filter to a WHERE clause, resulting in a lousy partially-cartesian-join followed by a post-processing cleanup. Yuck. Depending on the actual cardinality of your "skill group" table, it may work fast - or take whole days.


If your situation allows, do it in two queries.

First with the full join, which can have the range; Then with an outer join, matching on all the columns, but include a where clause for where one of the fields is null.

Ex:

create table tableC asselect a.*, b.skill_group     from tableA a     ,    tableB b     where a.employee_id= b.employee_id       and a.cal_date >= b.date_start       and a.cal_date <= b.date_end;with c as (select * from TableC)insert into tableCselect a.*, cast(null as string) as skill_groupfrom tableA a   left join c    on (a.employee_id= c.employee_id     and a.cal_date  = c.cal_date)where c.employee_id is null ;


MarkWusinich had a great solution but with one major issue. If table a has an employee ID twice within the date range table c will also have that employee_ID twice (if b was unique if not more) creating 4 records after the join. As such if A is not unique on employee_ID a group by will be necessary. Corrected below:

with C as(select a.employee_id, b.skill_group     from tableA a     ,    tableB b     where a.employee_id= b.employee_id       and a.cal_date >= b.date_start       and a.cal_date <= b.date_endgroup by a.employee_id, b.skill_group) Cselect a.*, c.skill_groupfrom tableA a left join c  on a.employee_id = c.employee_id     and a.cal_date  = c.cal_date;

Please note: If B was somehow intentionally not distinct on (employee_id, skill_group), then my query above would also have to be modified to appropriately reflect that.