How to put file to HDFS with Snappy compression

hadoop compression hdfs snappy

I suggest you to write map-reduce job to compress your data in hdfs. I don't know if there is a way to do automatic compress on hadoop put operation, but suppose it does not exist. One option is to put already compressed file:

snzip file.tarhdfs dfs -put file.tar.sz /user/hduser/test/

Another way is to compress it inside mapreduce job. As an option you can use hadoop streaming jar for compressing you files within hdfs:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \-Dmapred.output.compress=true \-Dmapred.compress.map.output=true \-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \-Dmapred.reduce.tasks=0 \-input <input-path> \-output $OUTPUT \

hadoop compression hdfs snappy

Say you have a Spark log file in hdfs that isn't compressed but you wanted to turn on spark.eventLog.compress true in the spark-defaults.conf and go ahead and compress the old logs. The map-reduce approach would make the most sense, but as a one off you can also use:

snzip -t hadoop-snappy local_file_will_end_in_dot_snappy

And then upload put it directly.

Installing snzip may look similar to this:

sudo yum install snappy snappy-develcurl -O https://dl.bintray.com/kubo/generic/snzip-1.0.4.tar.gztar -zxvf snzip-1.0.4.tar.gzcd snzip-1.0.4./configuremakesudo make install

Your round trip for a single file could be:

hdfs dfs -copyToLocal /var/log/spark/apps/application_1512353561403_50748_1 .snzip -t hadoop-snappy application_1512353561403_50748_1hdfs dfs -copyFromLocal application_1512353561403_50748_1.snappy /var/log/spark/apps/application_1512353561403_50748_1.snappy

Or with gohdfs:

hdfs cat /var/log/spark/apps/application_1512353561403_50748_1 \| snzip -t hadoop-snappy > zzzhdfs put zzz /var/log/spark/apps/application_1512353561403_50748_1.snappyrm zzz

hadoop compression hdfs snappy

We solve this with some scenario

If it is an rdd convert it to Data frame eg. RDD.toDF does not require parameters in case you wanna specifies the column name you can do it by rdd.toDF("c1","c2","c3")
After converting to DF suppose you want to set it to a parquet file format with snappy compression you need to use sqlContext
```
sqlContext.setConf("spark.parquet.commpression.codec","snappy")sqlContext.setConf("spark.parquet.commpression.codec","gzip") 
```
for gzip compression
After this use the following command XXDF.write.parquet("your_path") it will be saved with snappy compression

CodeHunter

How to put file to HDFS with Snappy compression

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last