Configure standalone spark for azure storage access

azure apache-spark azure-blob-storage azure-data-lake

I figured this out and decided to post a working project since that is always what I look for. It is hosted here:

The crux of it though is as @Shankar Koirala suggested:

For WASB, set the property to allow the url scheme to be recognized:

config.set("spark.hadoop.fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem");

Then set the property which authorizes access to the account. You will need one of these for each account you need to access. These are generated through the Azure Portal under the Access Keys section of the Storage Account blade.

    config.set("fs.azure.account.key.[storage-account-name].blob.core.windows.net", "[access-key]");

Now for adl, assign the fs scheme as with WASB:

    config.set("spark.hadoop.fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem");    // I don't know why this would be needed, but I saw it    // on an otherwise very helpful page . . .    config.set("spark.fs.AbstractFileSystem.adl.impl", "org.apache.hadoop.fs.adl.Adl");

. . . and finally, set the client access keys in these properties, again for each different account you need to access:

    config.set("fs.adl.oauth2.access.token.provider.type", "ClientCredential");    /* Client ID is generally the application ID from the azure portal app registrations*/    config.set("fs.adl.oauth2.client.id", "[client-id]");    /*The client secret is the key generated through the portal*/    config.set("fs.adl.oauth2.credential", "[client-secret]");    /*This is the OAUTH 2.0 TOKEN ENDPOINT under the ENDPOINTS section of the app registrations under Azure Active Directory*/    config.set("fs.adl.oauth2.refresh.url", "[oauth-2.0-token-endpoint]");

I hope this is helpful, and I wish I could give credit to Shankar for the answer, but I also wanted to get the exact details out there.

azure apache-spark azure-blob-storage azure-data-lake

I am not sure about the adl haven't tested but for the wasb you need to define the file system to be used in the underlying Hadoop configurations.

Since you are using spark 2.3 you can use spark session to create a entry point as

val spark = SparkSession.builder().appName("read from azure storage").master("local[*]").getOrCreate()

Now define the file system

spark.sparkContext.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")spark.sparkContext.hadoopConfiguration.set("fs.azure.account.key.yourAccount.blob.core.windows.net", "yourKey ")

Now read the parquet file as

val baseDir = "wasb[s]://BlobStorageContainer@yourUser.blob.core.windows.net/"val dfParquet = spark.read.parquet(baseDir + "pathToParquetFile")

Hope this helps!

CodeHunter

Configure standalone spark for azure storage access

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last