How to put file to HDFS with Snappy compression How to put file to HDFS with Snappy compression hadoop hadoop

How to put file to HDFS with Snappy compression


I suggest you to write map-reduce job to compress your data in hdfs. I don't know if there is a way to do automatic compress on hadoop put operation, but suppose it does not exist. One option is to put already compressed file:

snzip file.tarhdfs dfs -put file.tar.sz /user/hduser/test/

Another way is to compress it inside mapreduce job. As an option you can use hadoop streaming jar for compressing you files within hdfs:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \-Dmapred.output.compress=true \-Dmapred.compress.map.output=true \-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \-Dmapred.reduce.tasks=0 \-input <input-path> \-output $OUTPUT \


Say you have a Spark log file in hdfs that isn't compressed but you wanted to turn on spark.eventLog.compress true in the spark-defaults.conf and go ahead and compress the old logs. The map-reduce approach would make the most sense, but as a one off you can also use:

snzip -t hadoop-snappy local_file_will_end_in_dot_snappy

And then upload put it directly.

Installing snzip may look similar to this:

sudo yum install snappy snappy-develcurl -O https://dl.bintray.com/kubo/generic/snzip-1.0.4.tar.gztar -zxvf snzip-1.0.4.tar.gzcd snzip-1.0.4./configuremakesudo make install

Your round trip for a single file could be:

hdfs dfs -copyToLocal /var/log/spark/apps/application_1512353561403_50748_1 .snzip -t hadoop-snappy application_1512353561403_50748_1hdfs dfs -copyFromLocal application_1512353561403_50748_1.snappy /var/log/spark/apps/application_1512353561403_50748_1.snappy

Or with gohdfs:

hdfs cat /var/log/spark/apps/application_1512353561403_50748_1 \| snzip -t hadoop-snappy > zzzhdfs put zzz /var/log/spark/apps/application_1512353561403_50748_1.snappyrm zzz


We solve this with some scenario

  1. If it is an rdd convert it to Data frame eg. RDD.toDF does not require parameters in case you wanna specifies the column name you can do it by rdd.toDF("c1","c2","c3")
  2. After converting to DF suppose you want to set it to a parquet file format with snappy compression you need to use sqlContext

    sqlContext.setConf("spark.parquet.commpression.codec","snappy")sqlContext.setConf("spark.parquet.commpression.codec","gzip") 

    for gzip compression

  3. After this use the following command XXDF.write.parquet("your_path") it will be saved with snappy compression