Lessons learnt from migrating Notebooks-only Databricks Jobs to Databricks Asset Bundles

Jacek Laskowski
3 min readAug 11, 2024

--

I remember the days I used terraform to deploy Databricks jobs and pipelines using Databricks provider. You know, it’s that feeling when you’re still programming, but it’s just another tiny domain-specific language (DSL) that comes with a tool that you hear so much lately, and you can adjust your mindset fairly easily. It was OKish.

I could live with it, but it was neither Databricks-centric nor developer-focused.

It was the only tool I could use to have some form of automation to deploy Databricks jobs. There was no other viable choice 🤷

Until Databricks Asset Bundles (DAB) hit the shelves! 🥳

Alongside all the other neat features of Databricks CLI, Databricks engineers finally got great tooling.

Databricks CLI gives you auto-completion with DAB-related commands backed in (as bundle command), to name just a few.

Auto-completion in Databricks CLI (bundle command)
Auto-completion in Databricks CLI (bundle command)

I finally could leave the point-and-click way of deploying jobs and pipelines using Databricks UI behind and be back on the command line again! 😎

With all this new excitement, I needed more real-life production deployments to give DAB a serious try and unlock its full potential.

Being patient paid back real soon.

What was clear from my recent gigs is that there’s so much to digest under such a simple-looking fancy name, Databricks Data Intelligence Platform, with lots of intelligence backed in (and tons of buzzwords). Just focusing on the foundational aspect of the Databricks platform can cause a headache:

I’ve been in this camp myself, too.

And there are these high-level closed-source APIs:

And that’s just a very narrow view of what’s possible using Databricks (I don’t want to bother you with Databricks SQL, Delta Sharing, Mosaic AI, etc.) Databricks Inc. has been introducing products so quickly that it’s not hard to notice they have been struggling with proper naming and consistency and…well…unify them all in one simple-to-use platform.

Back to Databricks Asset Bundles.

This is a relatively new tool, too. Just after Databricks devops accepted terraform, they’re now faced with the question of learning yet another terraform-like utility or staying with the current status quo with terraform (that is just good enough). And with the others, there’s quite enough to learn. And apply properly.

tl;dr Databricks Asset Bundles is for Databricks developers as much as terraform is for cloud devops. Focus is different, and so are the tools.

If you want a CI/CD setup to validate, deploy, and run Databricks assets, use DAB, and you won’t look back.

Here are the lessons learnt from a couple of the recent Databricks Asset Bundles gigs of mine.

  1. When dealing with a single notebook-based Databricks job, start by applying what I used to call a “dbt mindset”, where there are as many tasks in the job as tables created. That leaves you with a multi-task job with one task per table. Here, tables are also the result of transformations (aka dfs in your Python code).
  2. At least move off Python to more SQL (for better collaboration with non-programmers) and let the high-level language help you understand data transformations better. It’s worth this extra mental effort.
  3. Any Python code should be part of a library attached to a cluster. Avoid %pip installs at the top of your notebooks (for better visibility of the dependencies).
  4. Use DAB’s Custom variables and Substitutions early and often.
  5. Make every effort to use Delta Live Tables as early as possible (but don’t overstretch your imagination so you won’t be able to deliver on time eventually). Aim at DLT as the ultimate goal, but keep making small baby steps with Databricks Workflows (jobs) until you’ve got the jobs ready for “upgrade” to DLT pipelines.

More thoughts to come soon as the discoveries settle down a bit in my mind (and I receive feedback from my dear Medium readers ❤️)

Email me at jacek@japila.pl should you want your Databricks development team to jump on the best of Databricks Asset Bundles.

DAB is not hard to learn, but some help from a practitioner may go the extra mile 😎

--

--

Jacek Laskowski

Freelance Data(bricks) Engineer | #ApacheSpark #DeltaLake #Databricks #ApacheKafka #KafkaStreams | Java Champion | @theASF | #DatabricksBeacons