pyspark: ValueError: Some of types cannot be determined after inferring
In order to infer the field type, PySpark looks at the non-none records in each field. If a field only has None records, PySpark can not infer the type and will raise that error.
Manually defining a schema will resolve the issue
>>> from pyspark.sql.types import StructType, StructField, StringType>>> schema = StructType([StructField("foo", StringType(), True)])>>> df = spark.createDataFrame([[None]], schema=schema)>>> df.show()+----+|foo |+----+|null|+----+
And to fix this problem, you could provide your own defined schema.
For example:
To reproduce the error:
>>> df = spark.createDataFrame([[None, None]], ["name", "score"])
To fix the error:
>>> from pyspark.sql.types import StructType, StructField, StringType, DoubleType>>> schema = StructType([StructField("name", StringType(), True), StructField("score", DoubleType(), True)])>>> df = spark.createDataFrame([[None, None]], schema=schema)>>> df.show()+----+-----+|name|score|+----+-----+|null| null|+----+-----+
If you are using the RDD[Row].toDF()
monkey-patched method you can increase the sample ratio to check more than 100 records when inferring types:
# Set sampleRatio smaller as the data size increasesmy_df = my_rdd.toDF(sampleRatio=0.01)my_df.show()
Assuming there are non-null rows in all fields in your RDD, it will be more likely to find them when you increase the sampleRatio
towards 1.0.