pyspark : how to check if a file exists in hdfs pyspark : how to check if a file exists in hdfs hadoop hadoop

pyspark : how to check if a file exists in hdfs


Rigth how it says Tristan Reid:

...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path

Once you have the list of files in a directory, it is easy to check if a particular file exist.

I hope it can help somehow.


Have you tried using pydoop? The exists function should work


One possibility is that you can use hadoop fs -lsr your_path to get all the paths, and then check if the paths you're interested in are in that set.

Regarding your crash, it's possible it was a result of all the calls to os.system, rather than being specific to the hadoop command. Sometimes calling an external process can result in issues related to buffers that are never getting released, in particular I/O buffers (stdin/stdout).

One solution would be to make a single call to a bash script that loops over all the paths. You can create the script using a string template in your code, fill in the array of paths in the script, write it, then execute.

It may also be a good idea to switch to the subprocess module of python, which gives you more granular control over handling subprocesses. Here's the equivalent of os.system:

process = subprocess.check_output(        args=your_script,        stdout=PIPE,        shell=True    )

Note that you can switch stdout to something like a file handle if that helps you with debugging or making the process more robust. Also you can switch that shell=True argument to False unless you're going to call an actual script or use shell-specific things like pipes or redirection.