Pyspark can't find csv in docker

My Solution:

I had to use an Ubuntu Image as a docker Image.I installed on this docker image python pyspark and spark.Dockerfile:

FROM ubuntu:latestRUN apt-get updateRUN apt-get install -y openjdk-8-jdkRUN apt-get updateRUN apt-get install git -yRUN apt-get updateRUN apt-get install wget -yCOPY handler.py /COPY Crimes.csv /RUN wget 'https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin- hadoop2.7.tgz'RUN tar -xzvf spark-3.0.1-bin-hadoop2.7.tgzRUN rm spark-3.0.1-bin-hadoop2.7.tgzRUN apt-get install -y python3-pip python3-dev python3RUN apt-get updateRUN pip3 install --upgrade pipRUN ln -s /usr/bin/python3 /usr/bin/pythonRUN pip install pysparkRUN sed -i.py 's/\r$//' handler.pyCMD ./spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://spark-master:7077 -- files Crimes.csv ./handler.py

The spark-submit command with --files is uploading the csv to the master and all slaves.After this i was able to read in the CSV file with following code:

from pyspark.sql import SparkSessionfrom pyspark import SparkFilesspark = SparkSession.builder.appName("pysparkapp").config("spark.executor.memory", "512m").getOrCreate()sc = spark.sparkContextdf = sc.textFile(SparkFiles.get('Crimes.csv'))

The SparkFiles.get('fileName') gets the path from the file within the spark system, which was uploaded with the spark-submit --files command.

CodeHunter

Pyspark can't find csv in docker

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last