databrickssnowflakemigrationlakehouseunity-catalog

Migration from Snowflake to Databricks — Lessons from a 2-Month Project

How we moved a client's Airflow + Snowflake data platform onto Databricks Lakehouse — the problems we hit, the trade-offs we made, and what we learned about governance, cost, and operations along the way.

Data DynamicsMay 8, 202611 min read

This post is a hands-on retrospective of a roughly two-month project in which we migrated a client's Airflow + Snowflake data platform onto a Databricks Lakehouse. It was not a "lift and shift" of queries — we had to redesign governance, architecture, and the cost model along the way. The notes below are organized in roughly the order we tackled them.

This article is an English rewrite of an original Medium post written by one of our engineers (Migration from Snowflake to Databricks), republished here on the company blog with the author's consent.

1. Project Overview

Starting Point

Existing stack: Airflow (orchestration) + Snowflake (DW) + assorted BI/ML tooling
Target stack: Databricks Lakehouse (Unity Catalog + Delta Lake + Workflows)
Duration: ~2 months
Team: 2–3 data engineers

Why Migrate at All

The motivation was not purely about cost. Several issues had been building up:

Data engineering, ML, and BI tools were fragmented, leading to multiple copies of the same data.
Fine-grained access control was awkward, and governance policies were inconsistent across business units.
Compute spend was trending upward, and ML workloads required additional GPU/Spark infrastructure on the side.
The engineering team wanted more direct control over Spark-based processing.

So the goal was not "swap Snowflake for Databricks." It was "consolidate scattered data assets into a single Lakehouse and redesign governance in the process."

2. Governance First — Unity Catalog and the Medallion Layout

Unity Catalog Before Anything Else

Before touching a single line of code, we designed the Unity Catalog catalog / schema / group structure.

In the old environment, Snowflake database/schema names were inconsistent across teams, and permissions had been granted to individual users in an ad-hoc way. Carrying that forward unchanged would have reproduced the same chaos in Databricks. So we set ground rules first:

Catalogs are per environment (prod, stg, dev).
Schemas are per business domain (sales, marketing, finance, …).
Permissions are granted to groups, never to individual users.
Account-level groups are synchronized via SCIM with the corporate IdP (Azure AD / Okta).

-- Catalog and schema setup
CREATE CATALOG IF NOT EXISTS prod;
CREATE SCHEMA  IF NOT EXISTS prod.sales;
 
-- Group-level grants
GRANT USE CATALOG ON CATALOG prod TO `data-platform-readers`;
GRANT USE SCHEMA, SELECT ON SCHEMA prod.sales TO `sales-analysts`;

Medallion Architecture (Bronze / Silver / Gold)

Tables inside each schema were laid out following the medallion architecture:

Layer	Role	Shape of data
Bronze	Raw data as it arrived from source systems	append-only, schema drift allowed
Silver	Cleaned, standardized business entities	unified types, PII masking, dedup
Gold	Aggregations and business metrics for analytics/BI	consumed directly by dashboards / ML features

Splitting work along these layers let us migrate layer by layer and made it easy to track which tables had landed where.

3. Bronze — `COPY INTO` First, Auto Loader in the End

Starting With the Obvious Tool

We had hundreds of source tables to bring across. The first attempt was the most direct approach we could think of: unload from Snowflake into object storage, then load from object storage into Databricks.

-- Unload from Snowflake to external storage (S3 / ADLS)
COPY INTO @ext_stage/sales/
FROM raw.sales
FILE_FORMAT = (TYPE = PARQUET)
HEADER = TRUE;

-- Ingest on the Databricks side
COPY INTO prod.bronze.sales
FROM 's3://landing/sales/'
FILEFORMAT = PARQUET;

What Went Wrong — Type Mismatches and Operational Overhead

This worked well enough for the initial full load but ran into two real problems:

Type mismatches. Snowflake types like NUMBER(38,0), VARIANT, and TIMESTAMP_TZ came out of Parquet unload in shapes we did not want. We ended up writing per-table casting code on the Databricks side, and that code multiplied with every new table.
Awkward incremental ingestion. COPY INTO is partially idempotent, but tracking exactly which files had been processed safely required state we did not want to maintain across hundreds of tables.

The Decision — `COPY INTO` for Full Load, Auto Loader for Incremental

We split the responsibilities:

Initial full load → COPY INTO (a one-shot operation; simplicity wins)
Incremental ingestion → Auto Loader (cloudFiles)

The biggest win from Auto Loader was schema inference and evolution:

df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "parquet")
    .option("cloudFiles.schemaLocation", "/Volumes/prod/_schemas/sales")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .load("s3://landing/sales/")
)
 
(
    df.writeStream
    .option("checkpointLocation", "/Volumes/prod/_checkpoints/sales")
    .trigger(availableNow=True)
    .toTable("prod.bronze.sales")
)

Because we had to apply the same pattern to hundreds of tables, we wrapped the snippet above in a metadata-driven generator that produced one ingestion pipeline per table. After that, almost no per-table code had to be written by hand, and new columns added to the source were picked up by Bronze automatically.

Lesson: full load and incremental load are easier to reason about as two separate tools. Trying to do both with a single mechanism leaves both halves messy.

4. Silver / Gold — Translating Legacy SQL and Procedures

We Gave Up on Automated Conversion

The Silver and Gold layers had years of accumulated Snowflake SQL, views, and JavaScript-based stored procedures. We initially planned to use a converter such as BladeBridge, but switched to manual translation for three reasons:

The license cost would have blown the budget.
The verification effort for auto-converted output ended up comparable to writing the code by hand anyway.
The migration was a rare opportunity to clean up legacy logic rather than preserve it.

Common Snowflake → Databricks SQL Differences

Area	Snowflake	Databricks (Spark SQL)
Semi-structured	`VARIANT`, `OBJECT`, `ARRAY`	`STRUCT`, `MAP`, `ARRAY` + `from_json`
Time zones	`TIMESTAMP_TZ`	`TIMESTAMP` (UTC recommended)
Merge	`MERGE` (row-level)	`MERGE INTO` (Delta, same semantics)
Time travel	`AT (TIMESTAMP => ...)`	`TIMESTAMP AS OF` / `VERSION AS OF`
Generated columns	`DEFAULT ... AS ...`	`GENERATED ALWAYS AS (...)`
Stored procedures	JavaScript / SQL Scripting	Python notebook + Workflows

Tables that leaned heavily on VARIANT got special attention. We analyzed which JSON keys were actually queried downstream, flattened those into typed columns, and kept the rest as STRING (or VARIANT). That single exercise visibly reduced query cost on the consuming side.

We Did Not Force Stored Procedures Back Into SQL

The trickiest piece was the JavaScript-based stored procedures. We tried porting them to Databricks SQL Scripting first, but quickly switched to PySpark + Workflows because:

Branching, looping, and error handling are far clearer in Python than in SQL.
Unit tests become possible (we used pytest).
Each unit of logic maps cleanly onto a Workflow task.

# A representative Snowflake stored procedure rewritten in PySpark
from pyspark.sql import functions as F
 
def build_daily_sales(spark, run_date: str) -> None:
    src = (
        spark.table("prod.silver.orders")
        .where(F.col("order_date") == F.lit(run_date))
    )
 
    agg = (
        src.groupBy("store_id", "product_id")
           .agg(
               F.sum("amount").alias("gmv"),
               F.countDistinct("order_id").alias("orders"),
           )
           .withColumn("run_date", F.lit(run_date))
    )
 
    (
        agg.write
           .format("delta")
           .mode("overwrite")
           .option("replaceWhere", f"run_date = '{run_date}'")
           .saveAsTable("prod.gold.daily_sales")
    )

The use of replaceWhere made the job idempotent on re-run — running it any number of times for the same date produces the same result. That alone made operations much calmer.

5. Rethinking the Cost Model

Snowflake and Databricks Bill Differently

In Snowflake, the mental model is simple: scale a warehouse up to go faster, suspend it to stop paying. Databricks separates cluster, job, data, and storage dimensions, and applying the Snowflake mental model unchanged tends to leak money.

We adopted the following defaults:

Interactive (development / exploration) → All-purpose cluster or SQL Warehouse (Serverless)
Scheduled batch → Job cluster (created on job start, terminated on completion)
Ad-hoc analytics → Serverless SQL Warehouse with auto-stop

Job Cluster vs. Serverless

For each daily ETL job we measured both:

Criterion	Job Cluster	Serverless
Startup time	1–3 minutes	seconds
Unit price	Lower	Higher
Short, frequent jobs	Startup cost dominates	Better fit
Long, heavy jobs	Better fit	Unit price piles up

The conclusion: mix the two based on each job's average duration and frequency. Pushing every workload onto a single option was the wrong move financially.

Cluster Metrics Drove Instance-Type Decisions

We started with default instance types and revisited them after a few weeks using Databricks cluster metrics (CPU / memory / GC / shuffle):

Memory-pressured jobs → memory-optimized instances
CPU-bound jobs → compute-optimized
Heavy-shuffle jobs → instance families with stronger disk I/O

This step alone reduced cost on some jobs by 30–50%.

6. Lakehouse Federation for External Systems

Not every dataset needed to move. Some OLTP systems (Oracle, PostgreSQL) remained owned by their respective operations teams, and analytics only needed read access.

For those systems we used Lakehouse Federation (now Lakeflow Connect) — no physical data movement required.

-- Mount an external Oracle as a foreign catalog
CREATE CONNECTION oracle_prod TYPE oracle
OPTIONS (
  host '...', port '1521', user '...', password '...'
);
 
CREATE FOREIGN CATALOG oracle_prod
USING CONNECTION oracle_prod
OPTIONS (database 'ORCL');
 
-- Then query it like any other catalog
SELECT *
FROM oracle_prod.app.customer c
JOIN prod.silver.orders     o
  ON o.customer_id = c.id;

Used judiciously — for lightweight lookups and joins that do not strain the source — Federation gives you an integrated view inside the Lakehouse without having to build yet another ETL pipeline.

7. Pitfalls We Saw Repeatedly

A few anti-patterns showed up over and over again. Teams new to Databricks tend to hit these in roughly the same order.

7.1. Don't Slice Job Dependencies Too Finely

Out of Airflow habit, every step had been turned into a separate Databricks job. As soon as job-to-job dependencies grow, you pay for it twice:

Each job pays its own cluster startup cost.
Failure surface area grows, making operations harder to monitor.

Rule of thumb: if a sequence of steps is fine to run on the same cluster in order, model them as tasks inside one Workflow, not as separate jobs.

7.2. The `display`, `collect`, `toPandas` Trap

Notebook authors reach for display(df) or df.collect() to inspect results during development, and these calls then survive into production code.

collect / toPandas pull all data to the driver — a classic OOM source.
display is meaningless cost outside an interactive context.

Strip these from production code, or replace them with a LIMIT-bounded show.

7.3. Single-Node pandas

When porting Python code unchanged, the heavy transformation step is often still pandas-on-the-driver. It works fine on small data and explodes at some point later.

Rewrite into the PySpark DataFrame API where possible.
If that is impractical, fall back to pandas API on Spark (pyspark.pandas).

7.4. Secrets in Notebooks

The most common security mistake early in the migration was pasting passwords directly into notebook cells. Always go through dbutils.secrets and a secret scope:

jdbc_password = dbutils.secrets.get(scope="prod", key="oracle_password")

8. Recommended Team Composition

For a project of this size (hundreds of source tables, two months), we recommend:

Role	Headcount	Responsibility
Cloud infrastructure engineer	1	Networking, IAM, storage, Databricks workspace setup
PySpark-fluent data engineer	1–2	Silver/Gold transformations, job authoring
Databricks-experienced engineer	1+	Unity Catalog design, cluster policies, cost guardrails
(Optional) Analytics / BI lead	1	Gold-layer validation, dashboard migration

In particular, we do not recommend starting a project of this scope without at least one Databricks-experienced engineer on the team. Wrong early decisions about catalogs, permissions, or cluster policies are expensive to undo later.

9. After the Migration

The numbers measured shortly after going live:

Area	Change
Infrastructure cost	~25% of the previous spend
Workflow integration	Data engineering / ML / BI on the same platform
Governance	Consistent group- and domain-level policies via Unity Catalog
Data quality	Result consistency improved through dedup and pipeline cleanup

The Most Important Lesson

The biggest lesson from this project: a migration is not a system relocation, it is an opportunity to redesign your data assets. Had we lifted Snowflake tables into Databricks one-for-one, we might have saved some cost, but we would have carried over every governance and quality problem along with the data.

While migrating, we deliberately:

retired tables that no one had used in months,
collapsed redundant copies of the same dataset into a single Silver table, and
rewrote inefficient SQL and stored-procedure logic.

That cleanup ended up creating more value than the cost savings did.