COLLECT_SET() in Hive (Hadoop)

hadoop mapreduce user-defined-functions hive

Major fail. My solution below. There was no issue with the COLLECT_SET after all, it just trying to COLLECT all of the items, of which there were infinite.

Why? Because I joined on something that wasn't even part of the set. Second join used to be the same ON condition, now it correctly says hit.session_key = evt.session_key

INSERT OVERWRITE TABLE sequence_result_1SELECT sess.session_key as session_key,       sess.remote_address as remote_address,       sess.hit_count as hit_count,       COLLECT_SET(evt.event_id) as event_set,       hit.rsp_timestamp as hit_timestamp,       sess.site_link as site_link    FROM tealeaf_session sess         JOIN site_event evt ON (sess.session_key = evt.session_key)        JOIN site_hit hit   ON (sess.session_key = hit.session_key)    WHERE evt.event_id IN(274,284,55,151)GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_linkORDER BY hit_timestamp;

hadoop mapreduce user-defined-functions hive

First thing I would try is getting rid of the sub-select and just join to site_event, then move the event_id filter into the outer where clause and change it to an in(). So something like:

SELECT sess.session_key as session_key,   sess.remote_address as remote_address,   sess.hit_count as hit_count,   COLLECT_SET(evt.event_id) as event_set,   hit.rsp_timestamp as hit_timestamp,   sess.site_link as site_linkFROM site_session sess     JOIN site_event evt ON (sess.session_key = evt.session_key)    JOIN site_hit hit ON (sess.session_key = evt.session_key)WHERE evt.event_id in(274,284,55151)GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_linkORDER BY hit_timestamp;

Additionally, I don't know the sizes of each table, but in general in Hive, you want to keep your largest tables (usually your fact table) on the right hand side of joins to reduce memory usage. The reason being that Hive attempts to hold the left hand side of a join in memory, and streams the right hand side in order to accomplish the join.

hadoop mapreduce user-defined-functions hive

I'd guess what's happen is that it's producing a COLLECT_SET() for EACH row that would have been returned.So for every row you're returning, it's returning the entire array produced by COLLECT_SET. That might be taxing and taking a long time.

Check the performance with COLLECT_SET out of the query. If that is speedy enough, push the calculation of COLLECT_SET into a sub-query and then use that column instead of doing the calculation where you are.

I haven't used COLLECT_SET or done any tests, from what your post, that's what I'd first suspect.

CodeHunter

COLLECT_SET() in Hive (Hadoop)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last