PySpark DataFrames - way to enumerate without converting to Pandas?

It doesn't work because:

the second argument for withColumn should be a Column not a collection. np.array won't work here
when you pass "index in indexes" as a SQL expression to where indexes is out of scope and it is not resolved as a valid identifier

PySpark >= 1.4.0

~~You can add row numbers using respective window function and query using Column.isin method or properly formated query string:~~

from pyspark.sql.functions import col, rowNumberfrom pyspark.sql.window import Windoww = Window.orderBy()indexed = df.withColumn("index", rowNumber().over(w))# Using DSLindexed.where(col("index").isin(set(indexes)))# Using SQL expressionindexed.where("index in ({0})".format(",".join(str(x) for x in indexes)))

~~It looks like window functions called without PARTITION BY clause move all data to the single partition so above may be not the best solution after all.~~

~~Any faster and simpler way to deal with it?~~

~~Not really. Spark DataFrames don't support random row access.~~

~~PairedRDD can be accessed using lookup method which is relatively fast if data is partitioned using HashPartitioner. There is also indexed-rdd project which supports efficient lookups.~~

Edit:

~~Independent of PySpark version you can try something like this:~~

from pyspark.sql import Rowfrom pyspark.sql.types import StructType, StructField, LongTyperow = Row("char")row_with_index = Row("char", "index")df = sc.parallelize(row(chr(x)) for x in range(97, 112)).toDF()df.show(5)## +----+## |char|## +----+## |   a|## |   b|## |   c|## |   d|## |   e|## +----+## only showing top 5 rows# This part is not tested but should work and save some work laterschema  = StructType(    df.schema.fields[:] + [StructField("index", LongType(), False)])indexed = (df.rdd # Extract rdd    .zipWithIndex() # Add index    .map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])) # Map to rows    .toDF(schema)) # It will work without schema but will be more expensive# inSet in Spark < 1.3indexed.where(col("index").isin(indexes))

python apache-spark bigdata pyspark rdd

If you want a number range that's guaranteed not to collide but does not require a .over(partitionBy()) then you can use monotonicallyIncreasingId().
from pyspark.sql.functions import monotonicallyIncreasingIddf.select(monotonicallyIncreasingId().alias("rowId"),"*")
Note though that the values are not particularly "neat". Each partition is given a value range and the output will not be contiguous. E.g. 0, 1, 2, 8589934592, 8589934593, 8589934594.
This was added to Spark on Apr 28, 2015 here: https://github.com/apache/spark/commit/d94cd1a733d5715792e6c4eac87f0d5c81aebbe2

python apache-spark bigdata pyspark rdd

You certainly can add an array for indexing, an array of your choice indeed:In Scala, first we need to create an indexing Array:
val index_array=(1 to df.count.toInt).toArrayindex_array: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
You can now append this column to your DF. First, For that, you need to open up our DF and get it as an array, then zip it with your index_array and then we convert the new array back into and RDD. The final step is to get it as a DF:
final_df = sc.parallelize((df.collect.map( x=>(x(0),x(1))) zip index_array).map( x=>(x._1._1.toString,x._1._2.toString,x._2))). toDF("column_name")
The indexing would be more clear after that.

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last