How do I read Snappy compressed files on HDFS without using Hadoop? How do I read Snappy compressed files on HDFS without using Hadoop? hadoop hadoop

How do I read Snappy compressed files on HDFS without using Hadoop?


I finally found out that I can use the following command to read the contents of a Snappy compressed file on HDFS:

hadoop fs -text /path/filename

Using the latest commands on Cloudera or HDP:

hdfs dfs -text /path/filename

If the intent is to download the file in text format for additional examination and processing, the output of that command can be piped to a file on the local system. You can also use head to just view the first few lines of the file.


Please take a look at this post on Cloudera blog. It explains how to use Snappy with Hadoop. Essentially, Snappy files on raw text are not splittable, so you cannot read a single file across multiple hosts.

The solution is to use Snappy in a container format, so essentially you're using Hadoop SequenceFile with compression set as Snappy. As described in this answer, you can set the property mapred.output.compression.codec to org.apache.hadoop.io.compress.SnappyCodec and setup your job output format as SequenceFileOutputFormat.

And then to read it, you should only need to use SequenceFile.Reader because the codec information is stored in the file header.


Thats because snappy used by hadoop has some more meta data which is not undesrtood by libraries like https://code.google.com/p/snappy/, You need to use hadoop native snappy to unsnap the data file that you downloaded.