Is it possible to store a numpy array in a Spark Dataframe Column? Is it possible to store a numpy array in a Spark Dataframe Column? numpy numpy

Is it possible to store a numpy array in a Spark Dataframe Column?


The source of the problem is that object returned from the UDF doesn't conform to the declared type. create_vector must be not only returning numpy.ndarray but also must be converting numerics to the corresponding NumPy types which are not compatible with DataFrame API.

The only option is to use something like this:

udf(lambda x: create_vector(x).tolist(), ArrayType(FloatType()))


One way to do that is if you convert each row of the numpy array in DataFrame to list of integer.

df.col_2 = df.col_2.map(lambda x: [int(e) for e in x])

Then, convert it to Spark DataFrame directly

df_spark = spark.createDataFrame(df)df_spark.select('col_1', explode(col('col_2')).alias('col_2')).show(14)