pyspark: ValueError: Some of types cannot be determined after inferring pyspark: ValueError: Some of types cannot be determined after inferring pandas pandas

pyspark: ValueError: Some of types cannot be determined after inferring


In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.

Manually defining a schema will resolve the issue

>>> from pyspark.sql.types import StructType, StructField, StringType>>> schema = StructType([StructField("foo", StringType(), True)])>>> df = spark.createDataFrame([[None]], schema=schema)>>> df.show()+----+|foo |+----+|null|+----+


And to fix this problem, you could provide your own defined schema.

For example:

To reproduce the error:

>>> df = spark.createDataFrame([[None, None]], ["name", "score"])

To fix the error:

>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])>>> df = spark.createDataFrame([[None, None]], schema=schema)>>> df.show()+----+-----+|name|score|+----+-----+|null| null|+----+-----+


If you are using the RDD[Row].toDF() monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:

# Set sampleRatio smaller as the data size increasesmy_df = my_rdd.toDF(sampleRatio=0.01)my_df.show()

Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio towards 1.0.