How to convert Spark RDD to pandas dataframe in ipython? How to convert Spark RDD to pandas dataframe in ipython? python python

How to convert Spark RDD to pandas dataframe in ipython?


You can use function toPandas():

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

This is only available if Pandas is installed and available.

>>> df.toPandas()     age   name0    2  Alice1    5    Bob


You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.

For example, let's say I have a text file, flights.csv, that has been read in to an RDD:

flights = sc.textFile('flights.csv')

You can check the type:

type(flights)<class 'pyspark.rdd.RDD'>

If you just use toPandas() on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:

# RDD to Spark DataFramesparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()#Spark DataFrame to Pandas DataFramepdsDF = sparkDF.toPandas()

You can check the type:

type(pdsDF)<class 'pandas.core.frame.DataFrame'>


I recommend a fast version of toPandas by joshlk

import pandas as pddef _map_to_pandas(rdds):    """ Needs to be here due to pickling issues """    return [pd.DataFrame(list(rdds))]def toPandas(df, n_partitions=None):    """    Returns the contents of `df` as a local `pandas.DataFrame` in a speedy fashion. The DataFrame is    repartitioned if `n_partitions` is passed.    :param df:              pyspark.sql.DataFrame    :param n_partitions:    int or None    :return:                pandas.DataFrame    """    if n_partitions is not None: df = df.repartition(n_partitions)    df_pand = df.rdd.mapPartitions(_map_to_pandas).collect()    df_pand = pd.concat(df_pand)    df_pand.columns = df.columns    return df_pand