Pular para o conteúdo principal

Glacierbase: Managing Iceberg Schema Migrations at Scale

· Leitura de 7 minutos
Jack Leitch

Glacierbase brings version control and consistency to schema migrations across open table formats, such as Iceberg. It helps WHOOP safely evolve petabyte-scale tables through reviewable, environment-aware migrations that are integrated into our engineering workflow and designed for scale.

Introduction

The data platform at WHOOP operates at a massive scale, powering streaming, batch, and analytical workloads across hundreds of terabytes, and in some cases petabytes, of Iceberg tables. Ensuring schema consistency and table performance across this environment is an ongoing challenge. At this scale, even small inconsistencies in partition specs, table properties, or schema evolution can ripple through downstream systems, affecting everything from analytics to model training. As our Iceberg footprint expanded across Spark streaming, EMR batch pipelines, and Snowflake queries, the team needed a more disciplined way to manage how tables evolve. This led to the creation of Glacierbase, a framework for managing schema and table configuration migrations across any open table format, including Iceberg, Delta Lake, and Apache Hudi.

Why We Built Glacierbase

As the data lake at WHOOP grew, schema management became increasingly complex. Some tables required refined partitioning strategies for performance, and others evolved to support new features or ML pipelines. At this scale, a single misconfigured partition or inconsistent schema update could result in hours of additional compute time or terabytes of wasted reads. Historically, these changes were applied through manual Spark jobs or SQL scripts, which made it hard to answer questions like:

  • What’s the current schema for this table, and when did it change?
  • Have all environments applied the same updates?
  • Who made a particular change and why?

We needed a framework that would standardize and version control schema evolution, something lightweight but reliable, like Liquibase, but for open table formats.

It’s worth noting that not every Iceberg table at WHOOP uses Glacierbase. Some catalogs are deliberately managed outside it, such as:

  • Ingestion tables that use Iceberg's schema merge where controlled schema drift is expected
  • CDC tables streaming from Postgres into Iceberg, where schemas are stable and automatically maintained by connectors

Glacierbase focuses on everything else: the high-value analytical and model-training datasets that live in our silver and gold layers, where schema change reproducibility, reliability, and traceability are critical. These include curated feature tables, model training datasets, and downstream metrics tables that power analytics and data science workloads across the organization.

How Glacierbase Works

Glacierbase treats table schema changes as immutable, versioned migrations. Each migration exists as a .sql file with metadata headers that describe the author and purpose of the change. An example migration can be seen below:

-- MIGRATION_DESCRIPTION: Increase commit resilience during high-concurrency backfills
-- MIGRATION_AUTHOR: Data Platform Team

ALTER TABLE catalog.namespace.table SET TBLPROPERTIES (
'commit.retry.num-retries'='10',
'commit.retry.min-wait-ms'='1000',
'write.merge.isolation-level'='snapshot'
);

This metadata makes migrations self-documenting and reviewable, allowing reviewers to immediately see why a change was made and how it impacts performance.

Each catalog in a Glacierbase project has its own configuration file that defines how migrations in that catalog are executed. At the catalog level, you specify Spark configuration and dependencies (other backend runtimes are also supported, e.g., Snowflake). This ensures every migration in that catalog runs with the correct execution context. Below is example config for an Iceberg Spark migration executor:

migrationExecutor:
type: spark
conf:
sparkConf:
"spark.sql.catalog.glue": "org.apache.iceberg.spark.SparkCatalog"
...
dependencies:
- "org.apache.iceberg:iceberg-spark-runtime-x.x_x:y.y.y"
...

This model allows Glacierbase to support any open table format simply by swapping the Spark configuration at the catalog level.

Executing Migrations

Glacierbase provides both a Python API and a CLI, enabling teams to run migrations consistently across environments.

  • Run all pending migrations: glacierbase migrate
  • Run migrations for a single catalog: glacierbase migrate --catalog analytics
  • List pending migrations: glacierbase pending --catalog analytics

Each migration run logs execution metadata and ensures ordering. Glacierbase also acquires a lock before starting a migration, preventing concurrent schema updates on the same catalog. If another process is running, Glacierbase raises a clear concurrency error to guarantee atomic, isolated schema evolution.

Key Features

  • Support for all open table formats: Glacierbase isn’t limited to Iceberg. The catalog configuration controls which backend runtime engine and dependencies are used for migrations, making the tool format-agnostic.
  • Environment-based variable injection: Migrations can include environment-dependent constants using templated variables. For example, Iceberg’s hidden bucket partition size can vary across environments:

    partitioned by (
    day(timestamp),
    bucket({{ .variables.catalog.namespace.tableName.bucketSize }}, id)
    )

    This lets us tune partitioning for each environment without maintaining separate migration files.

  • Immutable migrations: Once a migration has been executed, Glacierbase stores its file hash. If that hash changes, the system raises an error. Any update must be introduced as a new migration file, which preserves auditability and reproducibility.

Integrating Glacierbase into Our Workflow

Glacierbase fits seamlessly into our existing engineering processes. A typical lifecycle looks like this:

  1. A new feature or model requires a schema update.
  2. The engineer creates a migration file alongside the code change.
  3. The pull request includes both the migration and the related code, ensuring alignment between logic and data structure.
  4. Once approved, the PR is merged and deployed to dev, then to prod (with future CI/CD automation planned).

This process ensures all schema changes are peer-reviewed, tested, and consistently deployed with the same rigor applied to our application code.

Why This Matters for Iceberg

At the scale that WHOOP operates, schema and partition correctness directly determine performance and reliability. Even minor misconfigurations can lead to:

  • Full table scans instead of predicate pushdowns
  • Ballooning of metadata file sizes due to incorrectly configured write.metadata.metrics
  • Schema drift between environments that breaks downstream consumers

Glacierbase helps prevent these issues by enforcing version control and consistency. Combined with Iceberg’s atomic commits, it provides a safe, auditable, and automated way to evolve tables across environments.

The Builder Mindset

At WHOOP, we build tools that make complex systems simpler and safer. Glacierbase embodies that mindset as a disciplined, auditable framework that transforms schema management into a version-controlled process.

By treating table migrations as code, Glacierbase brings reliability and scalability to our Iceberg ecosystem.

Future Directions

We hope to continue to expand Glacierbase as a foundational part of the data lake ecosystem at WHOOP. Two major areas of focus are in progress:

  1. CI/CD Integration for Migrations Fully automate the migration lifecycle through CI/CD. The goal is to have migrations executed automatically as part of the deployment pipeline, validated in test environments, and then applied to production when changes are merged to main. This will make schema evolution completely continuous, reducing operational overhead and ensuring that environments stay perfectly aligned.

  2. Migration to the Polaris REST Iceberg Catalog We plan to move our Iceberg catalogs to Snowflake Open Catalog. Polaris will allow us to enforce more granular RBAC controls across the data lake. For example, we’ll be able to ensure that only Glacierbase can perform structural changes (except designated break-glass roles). This separation of privileges will strengthen governance, improve safety, and make schema management even more robust as our data platform continues to scale.


Love working with huge amounts of data and building systems that push technical limits? The Data Platform team at WHOOP is always exploring new ways to scale, optimize, and innovate — check out our open positions to be part of it.