pyspark : how to check if a file exists in hdfs

hadoop apache-spark filesystems hdfs pyspark

Rigth how it says Tristan Reid:

...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path

Once you have the list of files in a directory, it is easy to check if a particular file exist.

I hope it can help somehow.

hadoop apache-spark filesystems hdfs pyspark

Have you tried using pydoop? The exists function should work

hadoop apache-spark filesystems hdfs pyspark

One possibility is that you can use hadoop fs -lsr your_path to get all the paths, and then check if the paths you're interested in are in that set.

Regarding your crash, it's possible it was a result of all the calls to os.system, rather than being specific to the hadoop command. Sometimes calling an external process can result in issues related to buffers that are never getting released, in particular I/O buffers (stdin/stdout).

One solution would be to make a single call to a bash script that loops over all the paths. You can create the script using a string template in your code, fill in the array of paths in the script, write it, then execute.

It may also be a good idea to switch to the subprocess module of python, which gives you more granular control over handling subprocesses. Here's the equivalent of os.system:

process = subprocess.check_output(        args=your_script,        stdout=PIPE,        shell=True    )

Note that you can switch stdout to something like a file handle if that helps you with debugging or making the process more robust. Also you can switch that shell=True argument to False unless you're going to call an actual script or use shell-specific things like pipes or redirection.

CodeHunter

pyspark : how to check if a file exists in hdfs

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last