How to load json snappy compressed in HIVE How to load json snappy compressed in HIVE hadoop hadoop

How to load json snappy compressed in HIVE


  1. Put all files in HDFS folder and create external table on top of it. If files have names like .snappy Hive will automatically recognize them. You can specify SNAPPY output format for writing table:
set hive.exec.compress.output=true;set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;set mapreduce.output.fileoutputformat.compress.type=BLOCK; CREATE EXTERNAL TABLE mydirectory_tbl(  id   string,  name string)ROW FORMAT SERDE  'org.openx.data.jsonserde.JsonSerDe'LOCATION '/mydir' --this is HDFS/S3 location;
  1. JSONSerDe can parse all complex structures, it is much easier than using json_tuple. Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde. There is a section about nested structures and many examples of CREATE TABLE.

  2. If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple. But it is much more difficult.

  3. All JSON records should be in single line (no newlines inside JSON objects, as well as \r) . The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde


If your data is partitioned (ex. by date)

Create the table in Hive

CREATE EXTERNAL TABLE IF NOT EXISTS database.table (  filename STRING,  cnt BIGINT,  size DOUBLE) PARTITIONED BY (   \`date\` STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'LOCATION 'folder/path/in/hdfs'

Recover the partition (before the recovery, the table seems to be empty)

MSCK REPAIR TABLE database.table