How to convert Spark RDD to pandas dataframe in ipython?
You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.
For example, let's say I have a text file, flights.csv
, that has been read in to an RDD:
flights = sc.textFile('flights.csv')
You can check the type:
type(flights)<class 'pyspark.rdd.RDD'>
If you just use toPandas()
on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:
# RDD to Spark DataFramesparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()#Spark DataFrame to Pandas DataFramepdsDF = sparkDF.toPandas()
You can check the type:
type(pdsDF)<class 'pandas.core.frame.DataFrame'>
I recommend a fast version of toPandas by joshlk
import pandas as pddef _map_to_pandas(rdds): """ Needs to be here due to pickling issues """ return [pd.DataFrame(list(rdds))]def toPandas(df, n_partitions=None): """ Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is repartitioned if `n_partitions` is passed. :param df: pyspark.sql.DataFrame :param n_partitions: int or None :return: pandas.DataFrame """ if n_partitions is not None: df = df.repartition(n_partitions) df_pand = df.rdd.mapPartitions(_map_to_pandas).collect() df_pand = pd.concat(df_pand) df_pand.columns = df.columns return df_pand
<script src="https://gist.github.com/joshlk/871d58e01417478176e7.js"></script>