Create table as select percentage subquery in Impala DB Create table as select percentage subquery in Impala DB hadoop hadoop

Create table as select percentage subquery in Impala DB


With your examples, I would try these approaches that, I believe, could work fine.I checked the solution with Impala

CREATE TABLE dirty_table ( id INT, msg STRING)ROW FORMAT DELIMITED FIELDS TERMINATED  BY ','STORED AS TEXTFILE;[localhost.localdomain:21000] > SELECT * FROM dirty_table;Query: SELECT * FROM dirty_tableQuery submitted at: 2020-07-28 17:05:24 (Coordinator: http://localhost.localdomain:25000)Query progress can be monitored at: http://localhost.localdomain:25000/query_plan?query_id=5441d6a46ce61e7b:8e49432600000000+----+-------------+| id | msg         |+----+-------------+| 1  | 13321512121 || 2  | 13121212121 || 3  | 03121212121 || 4  | 13321512121 || 5  | 13121212121 || 6  | 03121212121 || 7  | 13121212121 |+----+-------------+Fetched 7 row(s) in 0.14s

First example

CREATE TABLE IF NOT EXISTS my_temp_table ASSELECT 41 AS rule_id, 49 AS record_id, val1 / val2 AS resultFROM (SELECT COUNT(1) AS val1 FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$' ) a,     (SELECT COUNT(1) AS val2 FROM dirty_table) b;[localhost.localdomain:21000] > CREATE TABLE IF NOT EXISTS my_temp_table AS                              > SELECT 41 AS rule_id, 49 AS record_id, val1 / val2 AS result                              > FROM (SELECT COUNT(1) AS val1 FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$' ) a,                              >      (SELECT COUNT(1) AS val2 FROM dirty_table) b;Query: CREATE TABLE IF NOT EXISTS my_temp_table ASSELECT 41 AS rule_id, 49 AS record_id, val1 / val2 AS resultFROM (SELECT COUNT(1) AS val1 FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$' ) a,     (SELECT COUNT(1) AS val2 FROM dirty_table) b+-------------------+| summary           |+-------------------+| Inserted 0 row(s) |+-------------------+Fetched 1 row(s) in 0.21s[localhost.localdomain:21000] > invalidate metadata;[localhost.localdomain:21000] > SELECT * FROM my_temp_table;Query: select * from my_temp_tableQuery submitted at: 2020-07-28 17:03:44 (Coordinator: http://localhost.localdomain:25000)Query progress can be monitored at: http://localhost.localdomain:25000/query_plan?query_id=47370bf793a09b:29c4dfa000000000+---------+-----------+--------------------+| rule_id | record_id | result             |+---------+-----------+--------------------+| 41      | 49        | 0.7142857142857143 |+---------+-----------+--------------------+Fetched 1 row(s) in 0.13s

Second example

DROP TABLE my_temp_table;CREATE TABLE IF NOT EXISTS my_temp_table AS SELECT result FROM    (WITH q1 AS (      SELECT COUNT(1) AS val FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$'    ),    q2 AS (      SELECT COUNT(1) val2 FROM dirty_table    )    SELECT 100 * q1.val / q2.val2 AS result    FROM q1, q2) t;[localhost.localdomain:21000] > CREATE TABLE IF NOT EXISTS my_temp_table AS                               > SELECT result FROM                              >     (WITH q1 AS (                              >       SELECT COUNT(1) AS val FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$'                              >     ),                              >     q2 AS (                              >       SELECT COUNT(1) val2 FROM dirty_table                              >     )                              >     SELECT 100 * q1.val / q2.val2 AS result                              >     FROM q1, q2) t;Query: CREATE TABLE IF NOT EXISTS my_temp_table ASSELECT result FROM    (WITH q1 AS (      SELECT COUNT(1) AS val FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$'    ),    q2 AS (      SELECT COUNT(1) val2 FROM dirty_table    )    SELECT 100 * q1.val / q2.val2 AS result    FROM q1, q2) t+-------------------+| summary           |+-------------------+| Inserted 1 row(s) |+-------------------+Fetched 1 row(s) in 0.40s[localhost.localdomain:21000] > invalidate metadata;[localhost.localdomain:21000] > SELECT * FROM my_temp_table;Query: SELECT * FROM my_temp_tableQuery submitted at: 2020-07-28 17:08:17 (Coordinator: http://localhost.localdomain:25000)Query progress can be monitored at: http://localhost.localdomain:25000/query_plan?query_id=3447684ef59d0c4:f70779200000000+-------------------+| result            |+-------------------+| 71.42857142857143 |+-------------------+Fetched 1 row(s) in 0.74s


I think a conditional average can do what you want simply and efficiently, with a single table scan:

select avg(case when msg regexp '^[1]([3-9])[0-9]{9}$' then 100.0 else 0 end) resultfrom dirty_table

You can turn this to a create table statement:

create table my_temp_table asselect avg(case when msg regexp '^[1]([3-9])[0-9]{9}$' then 100.0 else 0 end) resultfrom dirty_table