Can we load Parquet file into Hive directly?
Get schema of the parquet file using parquet tools, for details check link http://kitesdk.org/docs/0.17.1/labs/4-using-parquet-tools-solution.html
and build table using the schema on the top of the file, for details check Create Hive table to read parquet files from parquet/avro schema
Getting the schema is crucial, as you will have to create the table with the appropriate schema first in Hive and then point it to the parquet files.
I had a similar problem, where I had data in one VM and had to move it to another. Here is my walkthrough:
Find out about original Parquet files are (location and schema):
describe formatted users;
andshow create table users;
The latter will get you the schema right away and also point you to the location of HDFShdfs://hostname:port/apps/hive/warehouse/users
Find out about the partitioning of your table
show partitions users;
Copy the table's Parquet files from HDFS to a local directory
hdfs dfs -copyToLocal /apps/hive/warehouse/users
Move them across to the other cluster/VM or where you want them to go
Create the users table on your destination
CREATE USERS ...
by using the same schemaCREATE TABLE users ( name string, ... )PARTITIONED BY (...)STORED AS PARQUET;
Now, move the Parquet files in the respective folder (if needed find out about the location of the table you've just created)
hdfs dfs -ls /apps/hive/warehouse/users/hdfs dfs -copyFromLocal ../temp/* /apps/hive/warehouse/
For each partition, you'll have to point Hive to the respective subdirectory:
alter table users add partition (sign_up_date='19991231') location '/apps/hive/warehouse/users/sign_up_date=19991231';
(you might want to do this with a bash script)
That worked for me, hope it helps.
Don't know if it's a bit "hacky" but I use zeppelin (shipped with ambari). You can simply do the following in combination with spark2:
%spark2import org.apache.spark.sql.SaveModevar df = spark.read.parquet("hdfs:///my_parquet_files/*.parquet");df.write.mode(SaveMode.Overwrite).saveAsTable("imported_table")
The advantage of this way is that you can also import many parquet files even if they have a different schema.