Convert spark DataFrame column to python list Convert spark DataFrame column to python list python python

Convert spark DataFrame column to python list


See, why this way that you are doing is not working. First, you are trying to get integer from a Row Type, the output of your collect is like this:

>>> mvv_list = mvv_count_df.select('mvv').collect()>>> mvv_list[0]Out: Row(mvv=1)

If you take something like this:

>>> firstvalue = mvv_list[0].mvvOut: 1

You will get the mvv value. If you want all the information of the array you can take something like this:

>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]>>> mvv_arrayOut: [1,2,3,4]

But if you try the same for the other column, you get:

>>> mvv_count = [int(row.count) for row in mvv_list.collect()]Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'

This happens because count is a built-in method. And the column has the same name as count. A workaround to do this is change the column name of count to _count:

>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")>>> mvv_count = [int(row._count) for row in mvv_list.collect()]

But this workaround is not needed, as you can access the column using the dictionary syntax:

>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]

And it will finally work!


Following one liner gives the list you want.

mvv = mvv_count_df.select("mvv").rdd.flatMap(lambda x: x).collect()


This will give you all the elements as a list.

mvv_list = list(    mvv_count_df.select('mvv').toPandas()['mvv'])