What is a good Databricks workflow

Great question. Definitely dont modify your production code in place.

One recommended pattern is to keep separate folders in your workspace for dev-staging-prod. Do your dev work and then run tests in staging before finally promoting to production.

You can use the Databricks CLI to pull and push a notebook from one folder to another without breaking existing code. Going one step further, you can incorporate this pattern with git to sync with version control. In either case, the CLI gives you programmatic access to the workspace and that should make it easier to update code for production jobs.

Regarding your second point about IDEs - Databricks offers Databricks Connect, which let's you use your IDE while running commands on a cluster. Based on your pain points I think this is a great solution for you, as it will give your more visibility into the functions you have defined and so on. You can also write and run your unit tests this way.

Once you have your scripts ready to go you can always import them into the workspace as a notebook and run it as a job. Also know that you can run .py scripts as a job using the REST API.

azure azure-databricks

I personally prefer to package my code, and copy the *.whl package to DBFS, where I can install the tested package and import it.

Edit: To be more explicit.

The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.

This can be solved by either having separate environments DEV/TST/PRD. Or having versioned packages that can be modified in isolation. I'll clarify later on.

My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined. Is there a way to do efficient and systematic testing ?

Yes, using the versioned packages method I mentioned in combination with databricks-connect, you are totally able to use your IDE, implement tests, have proper git integration.

Git integration is very simple, but this is not my main concern.

Built-in git integration is actually very poor when working in bigger teams. You can't develop in the same notebook simultaneously, as there's a flat and linear accumulation of changes that are shared with your colleagues. Besides that, you have to link and unlink repositories that are prone to human error, causing your notebooks to be synchronized in the wrong folders, causing runs to break because notebooks can't be imported. I advise you to also use my packaging solution.

The packaging solution works as follows Reference:

List item
On your desktop, install pyspark
Download some anonymized data to work with
Develop your code with small bits of data, writing unit tests
When ready to test on big data, uninstall pyspark, install databricks-connect
When performance and integration is sufficient, push code to your remote repo
Create a build pipeline that runs automated tests, and builds the versioned package
Create a release pipeline that copies the versioned package to DBFS
In a "runner notebook" accept "process_date" and "data folder/filepath" as arguments, and import modules from your versioned package
Pass the arguments to your module to run your tested code

azure azure-databricks

The way we are doing it -

-Integrate the Dev notebooks with Azure DevOps.

-Create custom Build and Deployment tasks for Notebook, Jobs, package and cluster deployments. This is sort of easy to do with the DatabBricks RestAPI

https://docs.databricks.com/dev-tools/api/latest/index.html

Create Release pipeline for Test, Staging and Production deployments. -Deploy on Test and test. -Deploy on Staging and test. -Deploy on production

Hope this can help.

CodeHunter

What is a good Databricks workflow

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last