Join Tables on Date Range in Hive
RTFM - quoting LanguageManual Joins
Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.
You may try to move the BETWEEN filter to a WHERE clause, resulting in a lousy partially-cartesian-join followed by a post-processing cleanup. Yuck. Depending on the actual cardinality of your "skill group" table, it may work fast - or take whole days.
If your situation allows, do it in two queries.
First with the full join, which can have the range; Then with an outer join, matching on all the columns, but include a where clause for where one of the fields is null.
Ex:
create table tableC asselect a.*, b.skill_group from tableA a , tableB b where a.employee_id= b.employee_id and a.cal_date >= b.date_start and a.cal_date <= b.date_end;with c as (select * from TableC)insert into tableCselect a.*, cast(null as string) as skill_groupfrom tableA a left join c on (a.employee_id= c.employee_id and a.cal_date = c.cal_date)where c.employee_id is null ;
MarkWusinich had a great solution but with one major issue. If table a has an employee ID twice within the date range table c will also have that employee_ID twice (if b was unique if not more) creating 4 records after the join. As such if A is not unique on employee_ID a group by will be necessary. Corrected below:
with C as(select a.employee_id, b.skill_group from tableA a , tableB b where a.employee_id= b.employee_id and a.cal_date >= b.date_start and a.cal_date <= b.date_endgroup by a.employee_id, b.skill_group) Cselect a.*, c.skill_groupfrom tableA a left join c on a.employee_id = c.employee_id and a.cal_date = c.cal_date;
Please note: If B was somehow intentionally not distinct on (employee_id, skill_group), then my query above would also have to be modified to appropriately reflect that.