Pyspark: get list of files/directories on HDFS path

Using JVM gateway maybe is not so elegant, but in some cases the code below could be helpful:

URI           = sc._gateway.jvm.java.net.URIPath          = sc._gateway.jvm.org.apache.hadoop.fs.PathFileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystemConfiguration = sc._gateway.jvm.org.apache.hadoop.conf.Configurationfs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))for fileStatus in status:    print(fileStatus.getPath())

hadoop apache-spark pyspark

I believe it's helpful to think of Spark only as a data processing tool, with a domain that begins at loading the data. It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

There are a few available tools to do what you want, including esutil and hdfs. The hdfs lib supports both CLI and API, you can jump straight to 'how do I list HDFS files in Python' right here. It looks like this:

from hdfs import Configclient = Config().get_client('dev')files = client.list('the_dir_path')

hadoop apache-spark pyspark

If you use PySpark, you can execute commands interactively:

List all files from a chosen directory:

hdfs dfs -ls <path> e.g.: hdfs dfs -ls /user/path:

import osimport subprocesscmd = 'hdfs dfs -ls /user/path'files = subprocess.check_output(cmd, shell=True).strip().split('\n')for path in files:  print path

Or search files in a chosen directory:

hdfs dfs -find <path> -name <expression> e.g.: hdfs dfs -find /user/path -name *.txt:

import osimport subprocesscmd = 'hdfs dfs -find {} -name *.txt'.format(source_dir)files = subprocess.check_output(cmd, shell=True).strip().split('\n')for path in files:  filename = path.split(os.path.sep)[-1].split('.txt')[0]  print path, filename

CodeHunter

Pyspark: get list of files/directories on HDFS path

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last