Pyspark: display a spark data frame in a table format

The show method does what you're looking for.

For example, given the following dataframe of 3 rows, I can print just the first two rows like this:

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))df.show(n=2)

which yields:

+---+---+|  k|  v|+---+---+|foo|  1||bar|  2|+---+---+only showing top 2 rows

python pandas pyspark spark-dataframe

As mentioned by @Brent in the comment of @maxymoo's answer, you can try

df.limit(10).toPandas()

to get a prettier table in Jupyter. But this can take some time to run if you are not caching the spark dataframe. Also, .limit() will not keep the order of original spark dataframe.

python pandas pyspark spark-dataframe

Let's say we have the following Spark DataFrame:

df = sqlContext.createDataFrame(    [        (1, "Mark", "Brown"),         (2, "Tom", "Anderson"),         (3, "Joshua", "Peterson")    ],     ('id', 'firstName', 'lastName'))

There are typically three different ways you can use to print the content of the dataframe:

Print Spark DataFrame

The most common way is to use show() function:

>>> df.show()+---+---------+--------+| id|firstName|lastName|+---+---------+--------+|  1|     Mark|   Brown||  2|      Tom|Anderson||  3|   Joshua|Peterson|+---+---------+--------+

Print Spark DataFrame vertically

Say that you have a fairly large number of columns and your dataframe doesn't fit in the screen. You can print the rows vertically - For example, the following command will print the top two rows, vertically, without any truncation.

>>> df.show(n=2, truncate=False, vertical=True)-RECORD 0------------- id        | 1         firstName | Mark      lastName  | Brown    -RECORD 1------------- id        | 2         firstName | Tom       lastName  | Anderson only showing top 2 rows

Convert to Pandas and print Pandas DataFrame

Alternatively, you can convert your Spark DataFrame into a Pandas DataFrame using .toPandas() and finally print() it.

>>> df_pd = df.toPandas()>>> print(df_pd)   id firstName  lastName0   1      Mark     Brown1   2       Tom  Anderson2   3    Joshua  Peterson

Note that this is not recommended when you have to deal with fairly large dataframes, as Pandas needs to load all the data into memory. If this is the case, the following configuration will help when converting a large spark dataframe to a pandas one:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

For more details you can refer to my blog post Speeding up the conversion between PySpark and Pandas DataFrames

CodeHunter

Pyspark: display a spark data frame in a table format

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last