Create table as select percentage subquery in Impala DB
With your examples, I would try these approaches that, I believe, could work fine.I checked the solution with Impala
CREATE TABLE dirty_table ( id INT, msg STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILE;[localhost.localdomain:21000] > SELECT * FROM dirty_table;Query: SELECT * FROM dirty_tableQuery submitted at: 2020-07-28 17:05:24 (Coordinator: http://localhost.localdomain:25000)Query progress can be monitored at: http://localhost.localdomain:25000/query_plan?query_id=5441d6a46ce61e7b:8e49432600000000+----+-------------+| id | msg |+----+-------------+| 1 | 13321512121 || 2 | 13121212121 || 3 | 03121212121 || 4 | 13321512121 || 5 | 13121212121 || 6 | 03121212121 || 7 | 13121212121 |+----+-------------+Fetched 7 row(s) in 0.14s
First example
CREATE TABLE IF NOT EXISTS my_temp_table ASSELECT 41 AS rule_id, 49 AS record_id, val1 / val2 AS resultFROM (SELECT COUNT(1) AS val1 FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$' ) a, (SELECT COUNT(1) AS val2 FROM dirty_table) b;[localhost.localdomain:21000] > CREATE TABLE IF NOT EXISTS my_temp_table AS > SELECT 41 AS rule_id, 49 AS record_id, val1 / val2 AS result > FROM (SELECT COUNT(1) AS val1 FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$' ) a, > (SELECT COUNT(1) AS val2 FROM dirty_table) b;Query: CREATE TABLE IF NOT EXISTS my_temp_table ASSELECT 41 AS rule_id, 49 AS record_id, val1 / val2 AS resultFROM (SELECT COUNT(1) AS val1 FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$' ) a, (SELECT COUNT(1) AS val2 FROM dirty_table) b+-------------------+| summary |+-------------------+| Inserted 0 row(s) |+-------------------+Fetched 1 row(s) in 0.21s[localhost.localdomain:21000] > invalidate metadata;[localhost.localdomain:21000] > SELECT * FROM my_temp_table;Query: select * from my_temp_tableQuery submitted at: 2020-07-28 17:03:44 (Coordinator: http://localhost.localdomain:25000)Query progress can be monitored at: http://localhost.localdomain:25000/query_plan?query_id=47370bf793a09b:29c4dfa000000000+---------+-----------+--------------------+| rule_id | record_id | result |+---------+-----------+--------------------+| 41 | 49 | 0.7142857142857143 |+---------+-----------+--------------------+Fetched 1 row(s) in 0.13s
Second example
DROP TABLE my_temp_table;CREATE TABLE IF NOT EXISTS my_temp_table AS SELECT result FROM (WITH q1 AS ( SELECT COUNT(1) AS val FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$' ), q2 AS ( SELECT COUNT(1) val2 FROM dirty_table ) SELECT 100 * q1.val / q2.val2 AS result FROM q1, q2) t;[localhost.localdomain:21000] > CREATE TABLE IF NOT EXISTS my_temp_table AS > SELECT result FROM > (WITH q1 AS ( > SELECT COUNT(1) AS val FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$' > ), > q2 AS ( > SELECT COUNT(1) val2 FROM dirty_table > ) > SELECT 100 * q1.val / q2.val2 AS result > FROM q1, q2) t;Query: CREATE TABLE IF NOT EXISTS my_temp_table ASSELECT result FROM (WITH q1 AS ( SELECT COUNT(1) AS val FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]{9}$' ), q2 AS ( SELECT COUNT(1) val2 FROM dirty_table ) SELECT 100 * q1.val / q2.val2 AS result FROM q1, q2) t+-------------------+| summary |+-------------------+| Inserted 1 row(s) |+-------------------+Fetched 1 row(s) in 0.40s[localhost.localdomain:21000] > invalidate metadata;[localhost.localdomain:21000] > SELECT * FROM my_temp_table;Query: SELECT * FROM my_temp_tableQuery submitted at: 2020-07-28 17:08:17 (Coordinator: http://localhost.localdomain:25000)Query progress can be monitored at: http://localhost.localdomain:25000/query_plan?query_id=3447684ef59d0c4:f70779200000000+-------------------+| result |+-------------------+| 71.42857142857143 |+-------------------+Fetched 1 row(s) in 0.74s
I think a conditional average can do what you want simply and efficiently, with a single table scan:
select avg(case when msg regexp '^[1]([3-9])[0-9]{9}$' then 100.0 else 0 end) resultfrom dirty_table
You can turn this to a create table
statement:
create table my_temp_table asselect avg(case when msg regexp '^[1]([3-9])[0-9]{9}$' then 100.0 else 0 end) resultfrom dirty_table