Spark iterate HDFS directory Spark iterate HDFS directory hadoop hadoop

Spark iterate HDFS directory


You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true)

And with Spark...

FileSystem.get(sc.hadoopConfiguration).listFiles(..., true)

Edit

It's worth noting that good practice is to get the FileSystem that is associated with the Path's scheme.

path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)


Here's PySpark version if someone is interested:

    hadoop = sc._jvm.org.apache.hadoop    fs = hadoop.fs.FileSystem    conf = hadoop.conf.Configuration()     path = hadoop.fs.Path('/hivewarehouse/disc_mrt.db/unified_fact/')    for f in fs.get(conf).listStatus(path):        print(f.getPath(), f.getLen())

In this particular case I get list of all files that make up disc_mrt.unified_fact Hive table.

Other methods of FileStatus object, like getLen() to get file size are described here:

Class FileStatus


import  org.apache.hadoop.fs.{FileSystem,Path}FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///tmp")).foreach( x => println(x.getPath ))

This worked for me.

Spark version 1.5.0-cdh5.5.2