Is Snappy splittable or not splittable?

hadoop snappy

Both are correct but in different levels.

According with Cloudera blog http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

One thing to note is that Snappy is intended to be used with a
container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

To be more clear, is not the same:

<START-FILE>  <START-SNAPPY-BLOCK>     FULL CONTENT  <END-SNAPPY-BLOCK><END-FILE>

than

<START-FILE>  <START-SNAPPY-BLOCK1>     RECORD1  <END-SNAPPY-BLOCK1>  <START-SNAPPY-BLOCK2>     RECORD2  <END-SNAPPY-BLOCK2>  <START-SNAPPY-BLOCK3>     RECORD3  <END-SNAPPY-BLOCK3><END-FILE>

Snappy blocks are NOT splittable but files with snappy blocks are splittables.

hadoop snappy

All splittable codecs in hadoop must implement org.apache.hadoop.io.compress.SplittableCompressionCodec. Looking at the hadoop source code as of 2.7, we see org.apache.hadoop.io.compress.SnappyCodec does not implement this interface, so we know it is not splittable.

hadoop snappy

I have just tested with Spark 1.6.2 on HDFS, for same number of workers/processor, between a simple JSON file and compressed by snappy:

JSON: 4 files of 12GB each, Spark creates 388 tasks (1 task by HDFS block) (4*12GB/128MB => 384)
Snappy: 4 files of 3GB each, Spark creates 4 tasks

Snappy file is created like this: .saveAsTextFile("/user/qwant/benchmark_file_format/json_snappy", classOf[org.apache.hadoop.io.compress.SnappyCodec])

So Snappy is no splittable with Spark for JSON.

But, if you use parquet (or ORC) file format instead JSON, this will be splitable (even with gzip).

CodeHunter

Is Snappy splittable or not splittable?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last