Data Governance solution for Databricks, Synapse and ADLS gen2

azure architecture databricks data-lake azure-data-catalog

To better understand option #2 that you cited for data governance on Azure, here is a how-to tutorial demonstrating the experience of applying RLS on Databricks; a related Databricks video demo; and other data governance tutorials.

Full disclosure: My team produces content for data engineers at Immuta and I hope this helps save you some time in your research.

azure architecture databricks data-lake azure-data-catalog

I am currently exploring Immuta and Privacera, so I can't yet comment in detail on differences between these two. So far, Immuta gave me better impression with it's elegant policy based setup.

Still, there are ways to solve some of the issues you mentioned above without buying an external component:

1. Security

For RLS, consider using Table ACLs, and giving access only to certain Hive views.
For getting access to data inside ADLS, look at enabling password pass-through on clusters. Unfortunately, then you disable Scala.
You still need to setup permissions on Azure Data Lake Gen 2, which is awful experience for giving permissions on existing child items.
Please avoid creating dataset copies with columns/rows subsets, as data duplication is never a good idea.

2. Lineage

One option would be to look into Apache Atlas & Spline. Here is one example how to set this up https://medium.com/@reenugrewal/data-lineage-tracking-using-spline-on-atlas-via-event-hub-6816be0fd5c7
Unfortunately, Spline is still under development, even reproducing the setup mention in the article is not straight forward. Good news that Apache Atlas 3.0 has many available definitions to Azure Data Lake Gen 2 and other sources
In a few projects, I ended up creating custom logging of reads/writes (seems like you went on this path also). Based on these logs, I created a Power BI report to visualize the lineage.
Consider using Azure Data Factory for orchestration. With a proper ADF pipeline structure, you can have a high level lineage and help you see dependencies and rerun failed activities. You can read a bit more here: https://mrpaulandrew.com/2020/07/01/adf-procfwk-v1-8-complete-pipeline-dependency-chains-for-failure-handling/
Take a look at Marquez https://marquezproject.github.io/marquez/. Small open-source library that has some nice features, including data lineage.

3. Data quality

Investigate Amazon Deequ - Scala only so far but has some nice predefined data quality functions.
In many projects, we ended up with writing integration tests, checking data quality between moving from bronze (raw) to silver (standardized). Nothing fancy, pure PySpark.

4. Data life cycle management

One option is to use native data lake storage lifecycle management. That's not a viable alternative behind Delta/Parquet formats.
If you use Delta format, you can easier apply retention or pseudoanonymize
Second option, imagine that you have a table with information about all datasets (dataset_friendly_name, path, retention time, zone, sensitive_columns, owner, etc.). Your Databricks users use a small wrapper to read/write:
DataWrapper.Read("dataset_friendly_name")
DataWrapper.Write("destination_dataset_friendly_name")

It's up to you then to implement the logging, data loading behind the scenes. In addition you can skip sensitive_columns, acts based on retention time (both available in dataset info table). Requires quite some effort

You can always expand this table to more advanced schema, add extra information about pipelines, dependencies, etc. (see 2.4)

Hopefully you find something useful in my answer. It would be interesting to know which path you took.

azure architecture databricks data-lake azure-data-catalog

Azure Purview is a new service and it would fit your data governance needs well. It is currently (2020-12-04) in public preview. It contains features you are looking in your question, e.g data lineage, and works well with the Azure services you are using (Synapse, Databricks, ADLSg2).

Purview is not a cloud agnostic solution. It exposes Apache Atlas API so some core capabilies and integrations could be run in any cloud. I would still categorize Purview as Azure specific solution.

Purview can manage hybrid data, e.g. data on-premise or other clouds. This way it is agnostic on where your data is. If you need to have some data or use-cases outside Azure, Purview will be able to manage these data assets too.

I saw that data quality features are on the Purview roadmap and will be available later. Also other governance topics will be covered later, e.g. policies.

More info on Purview here: https://azure.microsoft.com/en-us/services/purview/

CodeHunter

Data Governance solution for Databricks, Synapse and ADLS gen2

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last