HIVE: How does 'LIMIT' on 'SELECT * from' work under-the-hood? HIVE: How does 'LIMIT' on 'SELECT * from' work under-the-hood? hadoop hadoop

HIVE: How does 'LIMIT' on 'SELECT * from' work under-the-hood?


If no optimizer applied, hive end up scanning entire table. But Hive optimizes this with hive.fetch.task.conversion released as part of HIVE-2925, To ease simple queries with simple conditions and not to run MR/Tez at all.

Supported values are none, minimal and more.

none: Disable hive.fetch.task.conversion (value added in Hive 0.14.0 with HIVE-8389)

minimal: SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only

more: SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)

Your question is more likely what happens when minimal or more is set. It just scans through the added files and read rows until reach leastRows() For more refer gitCode, Config and here


Under the hood a "SELECT" in a hive issues a FETCH task instead of spawing a mapreduce task. Think of it like a hadoop fs -get Point to be noted here is FETCH task works only on SELECT * and if you were to select a column fetch might not work.

Source : https://vcfvct.wordpress.com/2016/02/18/make-hive-query-faster-with-fetch-task/