Create External table in Azure databricks Create External table in Azure databricks azure azure

Create External table in Azure databricks


You can perform this operation, once the Azure Data lake storage is confiruged.

You should create a mount point using the method described below, if you want all users in the Databricks workspace to have access to the mounted Azure Data Lake Storage Gen2 account. The service client that you use to access the Azure Data Lake Storage Gen2 account should be granted access only to that Azure Data Lake Storage Gen2 account; it should not be granted access to other resources in Azure.

Once a mount point is created through a cluster, users of that cluster can immediately access the mount point. To use the mount point in another running cluster, users must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.

There are three primary ways of accessing Azure Data Lake Storage Gen2 from a Databricks cluster:

  1. Mounting an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal with delegated permissions and OAuth 2.0.
  2. Using a service principal directly.
  3. Using the Azure Data Lake Storage Gen2 storage account access key directly.

For more details, refer "Azure Data Lake Storage Gen2".

Hope this helps.


Sort of if you can use Python (or Scala).

Start by making the connection:

TenantID = "blah"def connectLake():  spark.conf.set("fs.azure.account.auth.type", "OAuth")  spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")  spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))  spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))  spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")connectLake()lakePath = "abfss://liquix@mystorageaccount.dfs.core.windows.net/"

Using Python you can register a table using:

spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")

You can now query that table if you have executed the connectLake() function - which is fine in your current session/notebook.

The problem is now if a new session comes in and they try select * from that table it will fail unless they run the connectLake() function first. There is no way around that limitation as you have to prove credentials to access the lake.

You may want to consider ADLS Gen2 credential pass through: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html

Note that this requires using a High Concurrency cluster.