Migration from Snowflake to Databricks — Lessons from a 2-Month Project
How we moved a client's Airflow + Snowflake data platform onto Databricks Lakehouse — the problems we hit, the trade-offs we made, and what we learned about governance, cost, and operations along the way.
This post is a hands-on retrospective of a roughly two-month project in which we migrated a client's Airflow + Snowflake data platform onto a Databricks Lakehouse. It was not a "lift and shift" of queries — we had to redesign governance, architecture, and the cost model along the way. The notes below are organized in roughly the order we tackled them.
This article is an English rewrite of an original Medium post written by one of our engineers (Migration from Snowflake to Databricks), republished here on the company blog with the author's consent.
1. Project Overview
Starting Point
- Existing stack: Airflow (orchestration) + Snowflake (DW) + assorted BI/ML tooling
- Target stack: Databricks Lakehouse (Unity Catalog + Delta Lake + Workflows)
- Duration: ~2 months
- Team: 2–3 data engineers
Why Migrate at All
The motivation was not purely about cost. Several issues had been building up:
- Data engineering, ML, and BI tools were fragmented, leading to multiple copies of the same data.
- Fine-grained access control was awkward, and governance policies were inconsistent across business units.
- Compute spend was trending upward, and ML workloads required additional GPU/Spark infrastructure on the side.
- The engineering team wanted more direct control over Spark-based processing.
So the goal was not "swap Snowflake for Databricks." It was "consolidate scattered data assets into a single Lakehouse and redesign governance in the process."
2. Governance First — Unity Catalog and the Medallion Layout
Unity Catalog Before Anything Else
Before touching a single line of code, we designed the Unity Catalog catalog / schema / group structure.
In the old environment, Snowflake database/schema names were inconsistent across teams, and permissions had been granted to individual users in an ad-hoc way. Carrying that forward unchanged would have reproduced the same chaos in Databricks. So we set ground rules first:
- Catalogs are per environment (
prod,stg,dev). - Schemas are per business domain (
sales,marketing,finance, …). - Permissions are granted to groups, never to individual users.
- Account-level groups are synchronized via SCIM with the corporate IdP (Azure AD / Okta).
-- Catalog and schema setup
CREATE CATALOG IF NOT EXISTS prod;
CREATE SCHEMA IF NOT EXISTS prod.sales;
-- Group-level grants
GRANT USE CATALOG ON CATALOG prod TO `data-platform-readers`;
GRANT USE SCHEMA, SELECT ON SCHEMA prod.sales TO `sales-analysts`;Medallion Architecture (Bronze / Silver / Gold)
Tables inside each schema were laid out following the medallion architecture:
| Layer | Role | Shape of data |
|---|---|---|
| Bronze | Raw data as it arrived from source systems | append-only, schema drift allowed |
| Silver | Cleaned, standardized business entities | unified types, PII masking, dedup |
| Gold | Aggregations and business metrics for analytics/BI | consumed directly by dashboards / ML features |
Splitting work along these layers let us migrate layer by layer and made it easy to track which tables had landed where.
3. Bronze — COPY INTO First, Auto Loader in the End
Starting With the Obvious Tool
We had hundreds of source tables to bring across. The first attempt was the most direct approach we could think of: unload from Snowflake into object storage, then load from object storage into Databricks.
-- Unload from Snowflake to external storage (S3 / ADLS)
COPY INTO @ext_stage/sales/
FROM raw.sales
FILE_FORMAT = (TYPE = PARQUET)
HEADER = TRUE;-- Ingest on the Databricks side
COPY INTO prod.bronze.sales
FROM 's3://landing/sales/'
FILEFORMAT = PARQUET;What Went Wrong — Type Mismatches and Operational Overhead
This worked well enough for the initial full load but ran into two real problems:
- Type mismatches. Snowflake types like
NUMBER(38,0),VARIANT, andTIMESTAMP_TZcame out of Parquet unload in shapes we did not want. We ended up writing per-table casting code on the Databricks side, and that code multiplied with every new table. - Awkward incremental ingestion.
COPY INTOis partially idempotent, but tracking exactly which files had been processed safely required state we did not want to maintain across hundreds of tables.
The Decision — COPY INTO for Full Load, Auto Loader for Incremental
We split the responsibilities:
- Initial full load →
COPY INTO(a one-shot operation; simplicity wins) - Incremental ingestion → Auto Loader (
cloudFiles)
The biggest win from Auto Loader was schema inference and evolution:
df = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "parquet")
.option("cloudFiles.schemaLocation", "/Volumes/prod/_schemas/sales")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.load("s3://landing/sales/")
)
(
df.writeStream
.option("checkpointLocation", "/Volumes/prod/_checkpoints/sales")
.trigger(availableNow=True)
.toTable("prod.bronze.sales")
)Because we had to apply the same pattern to hundreds of tables, we wrapped the snippet above in a metadata-driven generator that produced one ingestion pipeline per table. After that, almost no per-table code had to be written by hand, and new columns added to the source were picked up by Bronze automatically.
Lesson: full load and incremental load are easier to reason about as two separate tools. Trying to do both with a single mechanism leaves both halves messy.
4. Silver / Gold — Translating Legacy SQL and Procedures
We Gave Up on Automated Conversion
The Silver and Gold layers had years of accumulated Snowflake SQL, views, and JavaScript-based stored procedures. We initially planned to use a converter such as BladeBridge, but switched to manual translation for three reasons:
- The license cost would have blown the budget.
- The verification effort for auto-converted output ended up comparable to writing the code by hand anyway.
- The migration was a rare opportunity to clean up legacy logic rather than preserve it.
Common Snowflake → Databricks SQL Differences
| Area | Snowflake | Databricks (Spark SQL) |
|---|---|---|
| Semi-structured | VARIANT, OBJECT, ARRAY | STRUCT, MAP, ARRAY + from_json |
| Time zones | TIMESTAMP_TZ | TIMESTAMP (UTC recommended) |
| Merge | MERGE (row-level) | MERGE INTO (Delta, same semantics) |
| Time travel | AT (TIMESTAMP => ...) | TIMESTAMP AS OF / VERSION AS OF |
| Generated columns | DEFAULT ... AS ... | GENERATED ALWAYS AS (...) |
| Stored procedures | JavaScript / SQL Scripting | Python notebook + Workflows |
Tables that leaned heavily on VARIANT got special attention. We analyzed which JSON keys were actually queried downstream, flattened those into typed columns, and kept the rest as STRING (or VARIANT). That single exercise visibly reduced query cost on the consuming side.
We Did Not Force Stored Procedures Back Into SQL
The trickiest piece was the JavaScript-based stored procedures. We tried porting them to Databricks SQL Scripting first, but quickly switched to PySpark + Workflows because:
- Branching, looping, and error handling are far clearer in Python than in SQL.
- Unit tests become possible (we used pytest).
- Each unit of logic maps cleanly onto a Workflow task.
# A representative Snowflake stored procedure rewritten in PySpark
from pyspark.sql import functions as F
def build_daily_sales(spark, run_date: str) -> None:
src = (
spark.table("prod.silver.orders")
.where(F.col("order_date") == F.lit(run_date))
)
agg = (
src.groupBy("store_id", "product_id")
.agg(
F.sum("amount").alias("gmv"),
F.countDistinct("order_id").alias("orders"),
)
.withColumn("run_date", F.lit(run_date))
)
(
agg.write
.format("delta")
.mode("overwrite")
.option("replaceWhere", f"run_date = '{run_date}'")
.saveAsTable("prod.gold.daily_sales")
)The use of replaceWhere made the job idempotent on re-run — running it any number of times for the same date produces the same result. That alone made operations much calmer.
5. Rethinking the Cost Model
Snowflake and Databricks Bill Differently
In Snowflake, the mental model is simple: scale a warehouse up to go faster, suspend it to stop paying. Databricks separates cluster, job, data, and storage dimensions, and applying the Snowflake mental model unchanged tends to leak money.
We adopted the following defaults:
- Interactive (development / exploration) → All-purpose cluster or SQL Warehouse (Serverless)
- Scheduled batch → Job cluster (created on job start, terminated on completion)
- Ad-hoc analytics → Serverless SQL Warehouse with auto-stop
Job Cluster vs. Serverless
For each daily ETL job we measured both:
| Criterion | Job Cluster | Serverless |
|---|---|---|
| Startup time | 1–3 minutes | seconds |
| Unit price | Lower | Higher |
| Short, frequent jobs | Startup cost dominates | Better fit |
| Long, heavy jobs | Better fit | Unit price piles up |
The conclusion: mix the two based on each job's average duration and frequency. Pushing every workload onto a single option was the wrong move financially.
Cluster Metrics Drove Instance-Type Decisions
We started with default instance types and revisited them after a few weeks using Databricks cluster metrics (CPU / memory / GC / shuffle):
- Memory-pressured jobs → memory-optimized instances
- CPU-bound jobs → compute-optimized
- Heavy-shuffle jobs → instance families with stronger disk I/O
This step alone reduced cost on some jobs by 30–50%.
6. Lakehouse Federation for External Systems
Not every dataset needed to move. Some OLTP systems (Oracle, PostgreSQL) remained owned by their respective operations teams, and analytics only needed read access.
For those systems we used Lakehouse Federation (now Lakeflow Connect) — no physical data movement required.
-- Mount an external Oracle as a foreign catalog
CREATE CONNECTION oracle_prod TYPE oracle
OPTIONS (
host '...', port '1521', user '...', password '...'
);
CREATE FOREIGN CATALOG oracle_prod
USING CONNECTION oracle_prod
OPTIONS (database 'ORCL');
-- Then query it like any other catalog
SELECT *
FROM oracle_prod.app.customer c
JOIN prod.silver.orders o
ON o.customer_id = c.id;Used judiciously — for lightweight lookups and joins that do not strain the source — Federation gives you an integrated view inside the Lakehouse without having to build yet another ETL pipeline.
7. Pitfalls We Saw Repeatedly
A few anti-patterns showed up over and over again. Teams new to Databricks tend to hit these in roughly the same order.
7.1. Don't Slice Job Dependencies Too Finely
Out of Airflow habit, every step had been turned into a separate Databricks job. As soon as job-to-job dependencies grow, you pay for it twice:
- Each job pays its own cluster startup cost.
- Failure surface area grows, making operations harder to monitor.
Rule of thumb: if a sequence of steps is fine to run on the same cluster in order, model them as tasks inside one Workflow, not as separate jobs.
7.2. The display, collect, toPandas Trap
Notebook authors reach for display(df) or df.collect() to inspect results during development, and these calls then survive into production code.
collect/toPandaspull all data to the driver — a classic OOM source.displayis meaningless cost outside an interactive context.
Strip these from production code, or replace them with a LIMIT-bounded show.
7.3. Single-Node pandas
When porting Python code unchanged, the heavy transformation step is often still pandas-on-the-driver. It works fine on small data and explodes at some point later.
- Rewrite into the PySpark DataFrame API where possible.
- If that is impractical, fall back to pandas API on Spark (
pyspark.pandas).
7.4. Secrets in Notebooks
The most common security mistake early in the migration was pasting passwords directly into notebook cells. Always go through dbutils.secrets and a secret scope:
jdbc_password = dbutils.secrets.get(scope="prod", key="oracle_password")8. Recommended Team Composition
For a project of this size (hundreds of source tables, two months), we recommend:
| Role | Headcount | Responsibility |
|---|---|---|
| Cloud infrastructure engineer | 1 | Networking, IAM, storage, Databricks workspace setup |
| PySpark-fluent data engineer | 1–2 | Silver/Gold transformations, job authoring |
| Databricks-experienced engineer | 1+ | Unity Catalog design, cluster policies, cost guardrails |
| (Optional) Analytics / BI lead | 1 | Gold-layer validation, dashboard migration |
In particular, we do not recommend starting a project of this scope without at least one Databricks-experienced engineer on the team. Wrong early decisions about catalogs, permissions, or cluster policies are expensive to undo later.
9. After the Migration
The numbers measured shortly after going live:
| Area | Change |
|---|---|
| Infrastructure cost | ~25% of the previous spend |
| Workflow integration | Data engineering / ML / BI on the same platform |
| Governance | Consistent group- and domain-level policies via Unity Catalog |
| Data quality | Result consistency improved through dedup and pipeline cleanup |
The Most Important Lesson
The biggest lesson from this project: a migration is not a system relocation, it is an opportunity to redesign your data assets. Had we lifted Snowflake tables into Databricks one-for-one, we might have saved some cost, but we would have carried over every governance and quality problem along with the data.
While migrating, we deliberately:
- retired tables that no one had used in months,
- collapsed redundant copies of the same dataset into a single Silver table, and
- rewrote inefficient SQL and stored-procedure logic.
That cleanup ended up creating more value than the cost savings did.