How to do CI/CD for Databricks
March 13, 2026
A practical guide to CI/CD for Databricks, comparing databricks sync for fast iteration with Databricks Asset Bundles for production-grade deployments.
Last Updated: March 2026
CI/CD for Databricks uses two tools: databricks sync to push local file changes to your workspace in real time during development, and Databricks Asset Bundles (DABs) to deploy notebooks, jobs, pipelines, and cluster configurations through automated CI/CD pipelines across dev, staging, and production environments. Most teams use both.
This guide covers when to use each, how to set them up, and how they compare.
databricks sync: the inner development loop
databricks sync is a CLI command that syncs files from a local directory to your Databricks workspace. Think of it like rsync for Databricks. There are two ways to run it:
- Without flags, it does a one-time push of your files.
- With
--watchit stays running, automatically pushing changes every time you save a file locally.
When to use it
If you're writing a notebook or a Python module and want to test it against a running cluster, databricks sync gets your code there without manually uploading through the UI or pushing a branch and waiting for a pipeline. This is most useful when iterating on notebook logic against real data, or developing Python libraries that get imported by other notebooks. It's also good for prototyping new jobs before you formalize them into a bundle.
How it works
Install the Databricks CLI (v0.200+) and authenticate:
databricks auth login --host https://<workspace>.cloud.databricks.com
One-time sync:
databricks sync . /Repos/<user>/<project>
Continuous sync (stays running, pushes on every file save):
databricks sync . /Repos/your-user/your-project --watch
You can point either at a Repos path or a workspace path depending on your setup.
Limitations
databricks sync only handles files. So, moving python notebooks is trivially easy, but it won't help you with jobs, pipelines, or dashboards. It also doesn't handle multiple environments. There's no concept of "deploy this to staging" with sync. It's a 1:1 mapping from your local directory to a single workspace path.
That's fine for development, but we need something a bit stronger for production.
Databricks Asset Bundles: the outer deployment loop
Databricks Asset Bundles (DABs) are the production deployment tool. Databricks introduced them in 2023 to replace dbx (now deprecated). A bundle is a directory with your source code and YAML config files that define your jobs, pipelines, clusters, permissions, and environment-specific overrides.
What goes in a bundle
A typical bundle looks like this:
my-project/
├── databricks.yml # bundle definition
├── resources/
│ ├── jobs.yml # job definitions
│ └── pipelines.yml # DLT pipeline definitions
├── src/
│ ├── notebooks/
│ │ └── transform.py
│ └── libraries/
│ └── utils.py
└── tests/
└── test_utils.py
databricks.yml is where you define your bundle name, workspace mappings, and which resource files to include. A simply ver:
bundle:
name: my-etl-project
workspace:
host: https://your-workspace.cloud.databricks.com
targets:
dev:
mode: development
default: true
workspace:
root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev
staging:
workspace:
host: https://<staging-workspace>.cloud.databricks.com
root_path: /Shared/.bundle/${bundle.name}/staging
prod:
workspace:
host: https://<prod-workspace>.cloud.databricks.com
root_path: /Shared/.bundle/${bundle.name}/prod
permissions:
- level: CAN_MANAGE
group_name: production-admins
include:
- resources/*.yml
The deployment workflow
DABs have three commands:
databricks bundle validatechecks your YAML for errors and resolves variable references. Run this in CI before anything else.databricks bundle deploy -t <target>pushes your code and resource definitions to the target workspace. It creates or updates jobs, pipelines, and other resources to match your config.databricks bundle run -t <target> <resource>triggers a specific job or pipeline after deployment.
A CI/CD pipeline using DABs typically looks like this:
# .github/workflows/deploy.yml
name: Deploy Databricks Bundle
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: databricks/setup-cli@main
- run: databricks bundle validate -t staging
deploy-staging:
needs: validate
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- uses: databricks/setup-cli@main
- run: databricks bundle deploy -t staging
- run: databricks bundle run -t staging integration_tests
deploy-prod:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- uses: databricks/setup-cli@main
- run: databricks bundle deploy -t prod
Why DABs beat the alternatives
Before DABs, people used a mix of approaches: custom Python scripts calling the Databricks REST API, Terraform for workspace resources, dbx for job deployment, or just clicking around in the UI. Each had problems.
Terraform can manage Databricks resources, and I still recommend it for workspace-level infrastructure (Unity Catalog, metastores, workspace configuration). But Terraform treats notebooks and job definitions as opaque blobs. It doesn't understand the relationship between a notebook and the job that runs it. DABs do. They were built for this.
dbx worked but was maintained by Databricks Labs, not the core team. It used its own config format that didn't align well with the REST API. When Databricks shipped DABs, they deprecated dbx and told everyone to migrate. If you're still on dbx, you should move.
The UI is fine for exploration. It's a disaster for production. No version control, no review process, no rollback, no audit trail. If someone edits a production job through the UI at 2am, good luck figuring out what changed when it breaks the next morning.
databricks sync vs Databricks Asset Bundles
Side by side:
| databricks sync | Databricks Asset Bundles | |
|---|---|---|
| Purpose | Fast local development | Production deployment |
| What it deploys | Files only | Files, jobs, pipelines, clusters, permissions |
| Environment support | Single workspace path | Multiple targets (dev/staging/prod) |
| CI/CD integration | Not designed for it | Built for it |
| Config format | None | YAML (databricks.yml) |
| State management | None | Tracks deployed resources |
| Rollback | Revert your files | Redeploy previous commit |
| Learning curve | 5 minutes | A few hours |
They're complementary. databricks sync is the tool you use while writing code. DABs are the tool you use to ship it.
Practical setup for a team
If I were setting up CI/CD for a Databricks project from scratch, here's the workflow I'd build.
Local development
Each developer installs the Databricks CLI and authenticates to the dev workspace. They use databricks sync to push code to their personal directory under /Repos/username/project. Some teams share dev clusters; others let individuals spin up their own. Either works, just keep an eye on the bill.
Branch workflow
Your repo has a databricks.yml with at least two targets: dev and prod. Developers work on feature branches. When they open a pull request, CI runs databricks bundle validate to catch config errors. Code review happens in the PR like any other project.
Deployment pipeline
When a PR merges to main, CI deploys to staging using databricks bundle deploy -t staging and runs integration tests. If tests pass, a manual approval gate triggers the production deployment. Some teams automate this fully; others want a human click before prod.
Testing
Testing is the hard part. Unit tests for pure Python functions are easy. But testing notebook logic that queries Delta tables or calls Spark means you need a running cluster, and that costs money. Some options:
- Run unit tests locally with
pytestfor logic that doesn't depend on Spark - Use
databricks bundle runto trigger a test job on a cluster in CI - Use Databricks Connect to run Spark code from your local machine against a remote cluster
- For DLT pipelines, deploy to a staging environment and validate output tables
I lean towards option 2, using databricks bundle run and testing on a cluster in Dev. Maybe you've got a high end macbook and would rather test locally. There's no one right answer-- pick what works best for you.
Common mistakes
Skipping validation in CI. databricks bundle validate catches typos, missing references, and schema errors before you waste time on a failed deployment. Always run it.
One giant job definition. If your resources/jobs.yml is 500 lines long, split it up. DABs let you use multiple YAML files and include them. Keep each job or pipeline in its own file.
No environment separation. I've seen teams deploy directly to production with no staging environment. It works until it doesn't. Set up at least two targets from day one.
Editing production through the UI. This undermines your entire CI/CD setup. If someone changes a job in the UI, the next bundle deploy will overwrite those changes. Use workspace permissions to restrict who can edit production resources directly.
Ignoring cluster costs in CI. Every bundle run in CI starts a cluster. If your integration tests take 20 minutes and you're running them on every push, that adds up fast. Use job clusters (they terminate after the job finishes) and consider running expensive tests only on merges to main.
Conclusion
databricks sync gets your code to the workspace fast. DABs get it to production safely. Start with a minimal databricks.yml, get bundle deploy running in CI, and add staging environments and approval gates as your team needs them.
We don't have a ton of databricks jobs in general, but there are usually a small number of roles here.