How to convert Spark RDD to pandas dataframe in ipython?

You can use function toPandas():

Returns the contents of this DataFrame as Pandas pandas.DataFrame.
This is only available if Pandas is installed and available.

>>> df.toPandas()     age   name0    2  Alice1    5    Bob

python pandas ipython pyspark rdd

You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.

For example, let's say I have a text file, flights.csv, that has been read in to an RDD:

flights = sc.textFile('flights.csv')

You can check the type:

type(flights)<class 'pyspark.rdd.RDD'>

If you just use toPandas() on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:

# RDD to Spark DataFramesparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()#Spark DataFrame to Pandas DataFramepdsDF = sparkDF.toPandas()

You can check the type:

type(pdsDF)<class 'pandas.core.frame.DataFrame'>

python pandas ipython pyspark rdd

I recommend a fast version of toPandas by joshlk

import pandas as pddef _map_to_pandas(rdds):    """ Needs to be here due to pickling issues """    return [pd.DataFrame(list(rdds))]def toPandas(df, n_partitions=None):    """    Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is    repartitioned if `n_partitions` is passed.    :param df:              pyspark.sql.DataFrame    :param n_partitions:    int or None    :return:                pandas.DataFrame    """    if n_partitions is not None: df = df.repartition(n_partitions)    df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()    df_pand = pd.concat(df_pand)    df_pand.columns = df.columns    return df_pand

Expand snippet