More efficient query to avoid OutOfMemoryError in Hive More efficient query to avoid OutOfMemoryError in Hive hadoop hadoop

More efficient query to avoid OutOfMemoryError in Hive


Applying distinct to each dataset before joining them is safer because joining not unique keys will duplicate data.

I would recommend to partition your datasets by to_date(timestamp) field (yyyy-MM-dd) to make partition pruning work according to your where clause (check it works). Partition also by event field if datasets are too big and contain a lot of data where event <> 'Use'.

It's important to know on which stage it fails. Study the exception as well. If it fails on mappers then you should optimize your subqueries (add partitions as I mentioned). if it fails on reducer (join) then you should somehow improve join (try to reduce bytes per reducer:

set hive.exec.reducers.bytes.per.reducer=67108864; or even less) if it fails on writer (OrcWriter then try to add partition to Output table by substr from imei and 'distribute by substr(imei...)` at the end of query to reduce pressure on reducers).

Or add une more column with low cardinality and even distribution to distribute the data between more reducers evenly:

distribute by substr(imei...), col2

Make sure that partition column is in the distribute by. This will reduce the number of files written by each reducer and help to get rid of OOM


In order to improve performance, by looking at your query: I would partition the hive tables by yyyy, mm, dd, or by first two digits of imei, you will have to decide the variable according to your need of querying these tables and amount of data. but I would vote for yyyy, mm, dd, that will give you tremendous amount of improvement on performance. see improving-query-performance-using-partitioning

But for now, this should give you some improvements:

Select count(distinct(pd.imei))from pingdata pd join eventdata ed on pd.imei=ed.imeiwhere TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'and pd.timestamp=ed.pd.timestampand SUBSTR(pd.imei,12,2) in ('04','05') and ed.event = 'Use' AND ed.clientversion like '3.2%';

if TO_DATE(timestamp) values are inserted on same day, in other words if both values are same for date than and pd.timestamp=ed.pd.timestamp condition should be excluded.

Select count(distinct(pd.imei))from pingdata pd join eventdata ed on pd.imei=ed.imeiwhere TO_DATE(pd.timestamp) between '2016-06-01' AND '2016-07-17'and SUBSTR(pd.imei,12,2) in ('04','05') and ed.event = 'Use' AND ed.clientversion like '3.2%';

Try running both queries and compare results. Do let us know the differences and if you find this helpful.