Pyspark can't find csv in docker Pyspark can't find csv in docker docker docker

Pyspark can't find csv in docker


My Solution:

I had to use an Ubuntu Image as a docker Image.I installed on this docker image python pyspark and spark.Dockerfile:

FROM ubuntu:latestRUN apt-get updateRUN apt-get install -y openjdk-8-jdkRUN apt-get updateRUN apt-get install git -yRUN apt-get updateRUN apt-get install wget -yCOPY handler.py /COPY Crimes.csv /RUN wget 'https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin- hadoop2.7.tgz'RUN tar -xzvf spark-3.0.1-bin-hadoop2.7.tgzRUN rm spark-3.0.1-bin-hadoop2.7.tgzRUN apt-get install -y python3-pip python3-dev python3RUN apt-get updateRUN pip3 install --upgrade pipRUN ln -s /usr/bin/python3 /usr/bin/pythonRUN pip install pysparkRUN sed -i.py 's/\r$//' handler.pyCMD ./spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://spark-master:7077 -- files Crimes.csv ./handler.py

The spark-submit command with --files is uploading the csv to the master and all slaves.After this i was able to read in the CSV file with following code:

from pyspark.sql import SparkSessionfrom pyspark import SparkFilesspark = SparkSession.builder.appName("pysparkapp").config("spark.executor.memory", "512m").getOrCreate()sc = spark.sparkContextdf = sc.textFile(SparkFiles.get('Crimes.csv'))

The SparkFiles.get('fileName') gets the path from the file within the spark system, which was uploaded with the spark-submit --files command.