Spark iterate HDFS directory

You can use org.apache.hadoop.fs.FileSystem. Specifically, FileSystem.listFiles([path], true)

And with Spark...

FileSystem.get(sc.hadoopConfiguration).listFiles(..., true)

Edit

It's worth noting that good practice is to get the FileSystem that is associated with the Path's scheme.

path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)

hadoop hdfs apache-spark

Here's PySpark version if someone is interested:

    hadoop = sc._jvm.org.apache.hadoop    fs = hadoop.fs.FileSystem    conf = hadoop.conf.Configuration()     path = hadoop.fs.Path('/hivewarehouse/disc_mrt.db/unified_fact/')    for f in fs.get(conf).listStatus(path):        print(f.getPath(), f.getLen())

In this particular case I get list of all files that make up disc_mrt.unified_fact Hive table.

Other methods of FileStatus object, like getLen() to get file size are described here:

Class FileStatus

hadoop hdfs apache-spark

import  org.apache.hadoop.fs.{FileSystem,Path}FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///tmp")).foreach( x => println(x.getPath ))

This worked for me.

Spark version 1.5.0-cdh5.5.2

CodeHunter

Spark iterate HDFS directory

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last