Reading large number of Excel files into Apache Spark Reading large number of Excel files into Apache Spark hadoop hadoop

Reading large number of Excel files into Apache Spark


Use the following code to read excel files in Spark directly from HDFS using Hadoop FileSystem API. However you have to implement Apache POI API to parse the data

import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._import org.apache.spark.SparkConfimport java.util.Dateimport scala.io.Sourceimport java.io.{ InputStream, FileInputStream, File }import org.apache.poi.hssf.usermodel.HSSFWorkbookimport org.apache.poi.ss.usermodel.{ Cell, Row, Workbook, Sheet }import org.apache.poi.xssf.usermodel._import scala.collection.JavaConversions._import org.apache.poi.ss.usermodel.WorkbookFactoryimport org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FSDataOutputStream;import org.apache.hadoop.fs.Path;import java.net._object Excel {  def main(arr: Array[String]) {    val conf = new SparkConf().setAppName("Excel-read-write").setMaster("local[2]")    val sc = new SparkContext(conf)    val fs = FileSystem.get(URI.create("hdfs://localhost:9000/user/files/timetable.xlsx"),new Configuration());    val path=  new Path("hdfs://localhost:9000/user/files/timetable.xlsx");    val InputStream = fs.open(path)    read(InputStream)  }  def read(in:InputStream)={  }}

read(in:InputStream) method is where you implement Apache POI API to parse the data.


You can use Spark Excel Library for converting xlsx files to DataFrames directly. See this answer with a detailed example.

As of version 0.8.4, the library does not support streaming and loads all the source rows into memory for conversion.


If you are willing to build yourself a custom XLSX to CSV Converter, The Apache POI Event API would be Ideal for this. This API is suitable for Spreadsheets with large memory footprints. Look out what is it about here.Here is an example XSLX processing with the XSSF Event code