Spark DataFrame limit function takes too much time to show

python-3.x pyspark bigdata data-science

Your configuration is fine. This huge duration difference is caused by underlying implementation. The difference is that limit() reads all of the 70 million rows before it creates a dataframe with 30 rows. Show() in contrast just takes the first 20 rows of the existing dataframe and has therefore only to read this 20 rows. In case you are just interessted in showing 30 instead of 20 rows, you can call the show() method with 30 as parameter:

df.show(30, truncate=False)

python-3.x pyspark bigdata data-science

Spark copies the parameter you passed to limit() to each partition so, in your case, it tries to read 30 rows per partition. I guess you happened to have a huge number of partitions (which is not good in any case). Try df.coalesce(1).limit(30).show() and it should run as fast as df.show().

python-3.x pyspark bigdata data-science

As you've already experienced, limit() with large data has just terrible performance. Wanted to share a workaround for anyone else with this problem.If the limit count doesn't have to be exact, use sort() or orderBy() to sort a column, and use filter() to grab top k% of the rows.

CodeHunter

Spark DataFrame limit function takes too much time to show

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last