pandasUDF and pyarrow 0.15.0
It's not a bug. We made an important protocol change in 0.15.0 that makes the default behavior of pyarrow incompatible with older versions of Arrow in Java -- your Spark environment seems to be using an older version.
Your options are
- Set the environment variable
ARROW_PRE_0_15_IPC_FORMAT=1
from where you are using Python - Downgrade to pyarrow < 0.15.0 for now.
Hopefully the Spark community will be able to upgrade to 0.15.0 in Java soon so this issue goes away.
This is discussed in http://arrow.apache.org/blog/2019/10/06/0.15.0-release/
In Spark try the following appendix:
spark-submit --deploy-mode cluster --conf spark.yarn.appExecutorEnv.ARROW_PRE_0_15_IPC_FORMAT=1 --conf spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT=1 --conf spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT=1