COLLECT_SET() in Hive (Hadoop) COLLECT_SET() in Hive (Hadoop) hadoop hadoop

COLLECT_SET() in Hive (Hadoop)


Major fail. My solution below. There was no issue with the COLLECT_SET after all, it just trying to COLLECT all of the items, of which there were infinite.

Why? Because I joined on something that wasn't even part of the set. Second join used to be the same ON condition, now it correctly says hit.session_key = evt.session_key

INSERT OVERWRITE TABLE sequence_result_1SELECT sess.session_key as session_key,       sess.remote_address as remote_address,       sess.hit_count as hit_count,       COLLECT_SET(evt.event_id) as event_set,       hit.rsp_timestamp as hit_timestamp,       sess.site_link as site_link    FROM tealeaf_session sess         JOIN site_event evt ON (sess.session_key = evt.session_key)        JOIN site_hit hit   ON (sess.session_key = hit.session_key)    WHERE evt.event_id IN(274,284,55,151)GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_linkORDER BY hit_timestamp;


First thing I would try is getting rid of the sub-select and just join to site_event, then move the event_id filter into the outer where clause and change it to an in(). So something like:

SELECT sess.session_key as session_key,   sess.remote_address as remote_address,   sess.hit_count as hit_count,   COLLECT_SET(evt.event_id) as event_set,   hit.rsp_timestamp as hit_timestamp,   sess.site_link as site_linkFROM site_session sess     JOIN site_event evt ON (sess.session_key = evt.session_key)    JOIN site_hit hit ON (sess.session_key = evt.session_key)WHERE evt.event_id in(274,284,55151)GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_linkORDER BY hit_timestamp;

Additionally, I don't know the sizes of each table, but in general in Hive, you want to keep your largest tables (usually your fact table) on the right hand side of joins to reduce memory usage. The reason being that Hive attempts to hold the left hand side of a join in memory, and streams the right hand side in order to accomplish the join.


I'd guess what's happen is that it's producing a COLLECT_SET() for EACH row that would have been returned.So for every row you're returning, it's returning the entire array produced by COLLECT_SET. That might be taxing and taking a long time.

Check the performance with COLLECT_SET out of the query. If that is speedy enough, push the calculation of COLLECT_SET into a sub-query and then use that column instead of doing the calculation where you are.

I haven't used COLLECT_SET or done any tests, from what your post, that's what I'd first suspect.