pyspark : how to check if a file exists in hdfs
Rigth how it says Tristan Reid:
...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.
Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path
Once you have the list of files in a directory, it is easy to check if a particular file exist.
I hope it can help somehow.
One possibility is that you can use hadoop fs -lsr your_path
to get all the paths, and then check if the paths you're interested in are in that set.
Regarding your crash, it's possible it was a result of all the calls to os.system
, rather than being specific to the hadoop command. Sometimes calling an external process can result in issues related to buffers that are never getting released, in particular I/O buffers (stdin/stdout).
One solution would be to make a single call to a bash script that loops over all the paths. You can create the script using a string template in your code, fill in the array of paths in the script, write it, then execute.
It may also be a good idea to switch to the subprocess
module of python, which gives you more granular control over handling subprocesses. Here's the equivalent of os.system
:
process = subprocess.check_output( args=your_script, stdout=PIPE, shell=True )
Note that you can switch stdout
to something like a file handle if that helps you with debugging or making the process more robust. Also you can switch that shell=True
argument to False
unless you're going to call an actual script or use shell-specific things like pipes or redirection.