Persisting Spark Streaming output Persisting Spark Streaming output hadoop hadoop

Persisting Spark Streaming output


In solution #2, the number of files created can be controlled via the number of partitions of each RDD.

See this example:

// create a Hive table (assume it's already existing)sqlContext.sql("CREATE TABLE test (id int, txt string) STORED AS PARQUET")// create a RDD with 2 records and only 1 partitionval rdd = sc.parallelize(List( List(1, "hello"), List(2, "world") ), 1)// create a DataFrame from the RDDval schema = StructType(Seq( StructField("id", IntegerType, nullable = false), StructField("txt", StringType, nullable = false)))val df = sqlContext.createDataFrame(rdd.map( Row(_:_*) ), schema)// this creates a single file, because the RDD has 1 partitiondf.write.mode("append").saveAsTable("test")

Now, I guess you can play with the frequency at which you pull data from Kafka, and the number of partitions of each RDD (default, the partitions of your Kafka topic, that you can possibly reduce by repartitioning).

I'm using Spark 1.5 from CDH 5.5.1, and I get the same result using either df.write.mode("append").saveAsTable("test") or your SQL string.


I think the small file problem could be resolved somewhat. You may be getting large number of files based on kafka partitions. For me, I have 12 partition Kafka topic and I write using spark.write.mode("append").parquet("/location/on/hdfs").

Now depending on your requirements, you can either add coalesce(1) or more to reduce number of files. Also another option is to increase the micro batch duration. For example, if you can accept 5 minutes delay in writing day, you can have micro batch of 300 seconds.

For the second issues, the batches queue up only because you don't have back pressure enabled. First you should verify what is the max you can process in a single batch. Once you can get around that number, you can set spark.streaming.kafka.maxRatePerPartition value and spark.streaming.backpressure.enabled=true to enable limited number of records per micro batch. If you still cannot meet the demand, then the only options are to either increase partitions on topic or to increase resources on spark application.