Spark can access Hive table from pyspark but not from spark-submit
Spark 2.x
The same problem may occur in Spark 2.x if SparkSession
has been created without enabling Hive support.
Spark 1.x
It is pretty simple. When you use PySpark shell, and Spark has been build with Hive support, default SQLContext
implementation (the one available as a sqlContext
) is HiveContext
.
In your standalone application you use plain SQLContext
which doesn't provide Hive capabilities.
Assuming the rest of the configuration is correct just replace:
from pyspark.sql import SQLContextsqlContext = SQLContext(sc)
with
from pyspark.sql import HiveContextsqlContext = HiveContext(sc)
In Spark 2.x (Amazon EMR 5+) you will run into this issue with spark-submit
if you don't enable Hive support like this:
from pyspark.sql import SparkSessionspark = SparkSession.builder.master("yarn").appName("my app").enableHiveSupport().getOrCreate()
Your problem may be related to your Hive
configurations. If your configurations use local metastore
, the metastore_db
directory gets created in the directory that you started you Hive
server from.
Since spark-submit
is launched from a different directory, it is creating a new metastore_db
in that directory which does not contain information about your previous tables.
A quick fix would be to start the Hive
server from the same directory as spark-submit
and re-create your tables.
A more permanent fix is referenced in this SO Post
You need to change your configuration in $HIVE_HOME/conf/hive-site.xml
property name = javax.jdo.option.ConnectionURLproperty value = jdbc:derby:;databaseName=/home/youruser/hive_metadata/metastore_db;create=true
You should now be able to run hive from any location and still find your tables