How to put file to HDFS with Snappy compression
I suggest you to write map-reduce job to compress your data in hdfs. I don't know if there is a way to do automatic compress on hadoop put operation, but suppose it does not exist. One option is to put already compressed file:
snzip file.tarhdfs dfs -put file.tar.sz /user/hduser/test/
Another way is to compress it inside mapreduce job. As an option you can use hadoop streaming jar for compressing you files within hdfs:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \-Dmapred.output.compress=true \-Dmapred.compress.map.output=true \-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \-Dmapred.reduce.tasks=0 \-input <input-path> \-output $OUTPUT \
Say you have a Spark log file in hdfs that isn't compressed but you wanted to turn on spark.eventLog.compress true
in the spark-defaults.conf
and go ahead and compress the old logs. The map-reduce approach would make the most sense, but as a one off you can also use:
snzip -t hadoop-snappy local_file_will_end_in_dot_snappy
And then upload put it directly.
Installing snzip may look similar to this:
sudo yum install snappy snappy-develcurl -O https://dl.bintray.com/kubo/generic/snzip-1.0.4.tar.gztar -zxvf snzip-1.0.4.tar.gzcd snzip-1.0.4./configuremakesudo make install
Your round trip for a single file could be:
hdfs dfs -copyToLocal /var/log/spark/apps/application_1512353561403_50748_1 .snzip -t hadoop-snappy application_1512353561403_50748_1hdfs dfs -copyFromLocal application_1512353561403_50748_1.snappy /var/log/spark/apps/application_1512353561403_50748_1.snappy
Or with gohdfs:
hdfs cat /var/log/spark/apps/application_1512353561403_50748_1 \| snzip -t hadoop-snappy > zzzhdfs put zzz /var/log/spark/apps/application_1512353561403_50748_1.snappyrm zzz
We solve this with some scenario
- If it is an rdd convert it to Data frame eg.
RDD.toDF
does not require parameters in case you wanna specifies the column name you can do it byrdd.toDF("c1","c2","c3")
After converting to DF suppose you want to set it to a parquet file format with snappy compression you need to use sqlContext
sqlContext.setConf("spark.parquet.commpression.codec","snappy")sqlContext.setConf("spark.parquet.commpression.codec","gzip")
for gzip compression
After this use the following command
XXDF.write.parquet("your_path")
it will be saved with snappy compression