More efficient query to avoid OutOfMemoryError in Hive

performance hadoop hive out-of-memory memory-efficient

Applying distinct to each dataset before joining them is safer because joining not unique keys will duplicate data.

I would recommend to partition your datasets by to_date(timestamp) field (yyyy-MM-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.

It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as I mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:

set hive.exec.reducers.bytes.per.reducer=67108864; or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).

Or add une more column with low cardinality and even distribution to distribute the data between more reducers evenly:

distribute by substr(imei...), col2

Make sure that partition column is in the distribute by. This will reduce the number of files written by each reducer and help to get rid of OOM

performance hadoop hive out-of-memory memory-efficient

In order to improve performance, by looking at your query: I would partition the hive tables by yyyy, mm, dd, or by first two digits of imei, you will have to decide the variable according to your need of querying these tables and amount of data. but I would vote for yyyy, mm, dd, that will give you tremendous amount of improvement on performance. see improving-query-performance-using-partitioning

But for now, this should give you some improvements:

Select count(distinct(pd.imei))from pingdata pd join eventdata ed on pd.imei=ed.imeiwhere TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'and pd.timestamp=ed.pd.timestampand SUBSTR(pd.imei,12,2) in ('04','05') and ed.event = 'Use' AND ed.clientversion like '3.2%';

if TO_DATE(timestamp) values are inserted on same day, in other words if both values are same for date than and pd.timestamp=ed.pd.timestamp condition should be excluded.

Select count(distinct(pd.imei))from pingdata pd join eventdata ed on pd.imei=ed.imeiwhere TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'and SUBSTR(pd.imei,12,2) in ('04','05') and ed.event = 'Use' AND ed.clientversion like '3.2%';

Try running both queries and compare results. Do let us know the differences and if you find this helpful.

CodeHunter

More efficient query to avoid OutOfMemoryError in Hive

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last