Skip to main content

2 posts tagged with "data lake"

View All Tags

· 8 min read
Sam Bart

Introduction

At WHOOP, we process large amounts of data to deliver on our mission to unlock human performance and Healthspan. Despite staying well below documented rate limits, one of our highest throughput S3 buckets experienced millions of 503 "slow down" errors per month—failures that we wouldn’t have expected to see consistently.

This post details how we reduced these errors by more than 99.99% through a simple yet counterintuitive solution: reversing IDs in our S3 key partitioning strategy. Our successful write requests went from ~99.9% (three nines) to ~99.999999% (eight nines)—a more than 10,000x improvement in error rates.

Background: Data Pipeline at WHOOP

WHOOP securely stores petabytes of data in S3, serving as a source of truth for the member experience, analytics, and ML models. Historically, slight delays in data availability were acceptable due to our hot storage layers handling real-time needs. However, as part of a major architecture redesign, we needed this S3-backed data to be both highly reliable and as current as possible, making even the 0.1% error rate we were seeing unacceptable for our growing scale.

The Problem: Millions of 503 Slow Down Errors

Our original S3 key structure prioritized temporal organization, which seemed logical for our analytics queries:

date=YYYY-MM-DD/id=12345/file_specifics

This pattern worked well initially, but as we were looking to rearchitect other aspects of our system we encountered a critical issue: relatively frequent 503 slow down errors during data ingestion. Despite less than 0.1% of our write requests encountering these S3 server errors, at our scale this translated to millions of monthly failures and retries.

This was puzzling. According to AWS documentation, S3 supports at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned prefix. We included IDs in our S3 prefix and were definitely making fewer than 3,500 RPS per ID. So why were we consistently hitting these errors?

Understanding Dynamic Partitioning in S3

The root cause lies in how S3 handles partitioning. The documented 3,500 RPS limit is per partitioned prefix. However, S3 doesn't create every possible partition immediately—instead, it dynamically creates new partitions character by character based on observed traffic patterns.

Here's what happens throughout a day with our date-first key pattern:

Start:        date=YYYY-MM-DD/*           ← Single partition

Later: date=YYYY-MM-DD/id=1* ← Split on first digit
date=YYYY-MM-DD/id=2*
...

Even later: date=YYYY-MM-DD/id=12* ← Further splits
...

The critical issue: When the date changes at midnight, all those fine-grained partitions S3 created throughout the previous day are no longer relevant. The next day started with a single partition for the new day, and S3 had to dynamically split it again.

During high-traffic periods before sufficient partitions existed, we'd hit rate limits and receive 503 errors. We were stuck in a cycle: S3 would eventually partition deeply enough to handle our load, but we'd lose all that partitioning at midnight and start over, hitting 503 errors again once the traffic got high enough.

Evaluating Solutions

The goal was clear: maintain stable partitions across day boundaries while preserving the ability to scale to thousands of requests per second per ID to guarantee we never hit a bottleneck there. We considered several approaches, some of the major ones are outlined below.

S3 Express / Directory Buckets

AWS launched S3 Express One Zone (directory buckets) roughly two years ago as of writing this post. These buckets solve the dynamic partitioning problem by provisioning full throughput from the start: 200k read TPS / 100k write TPS immediately available, with no gradual scaling. Additionally, they offer lower latency and cheaper per-request pricing.

Tradeoffs:

  1. Storage costs: Significantly more expensive per GB than Standard S3
  2. Single-AZ durability: Directory buckets only support S3 Express One Zone storage class, storing data in a single availability zone. Our standard storage class provides redundancy across a minimum of 3 AZs.

Much like we never compromise on security or privacy for our members, we also couldn’t afford to compromise on durability. There are ways to mitigate that durability risk with additional architectural complexity, but it wasn't a tradeoff we wanted to make, and ultimately we identified better options for our use case.

ID First Partitioning

The most obvious solution: swap the prefix order to put IDs first.

id=12345/date=YYYY-MM-DD/file_specifics

This would preserve partitions across day boundaries, allowing S3 to maintain per-ID partitions indefinitely and scale up to 3,500/5,500 RPS per ID per day as needed.

Tradeoffs:

  1. Query pattern impact: This changes which queries are efficient. Instead of "show all IDs that have data for a given day" (efficient with date-first), we optimize for "show all days that a given ID has data" (efficient with ID-first). For our primary workload, this was acceptable—most queries specify both ID and date, and for ad-hoc analytics we have other systems. (Check out our Glacierbase blog post for a look into some of those other systems.)
  2. Benford's Law problem: The more serious issue. According to Benford's law, naturally occurring datasets often exhibit a logarithmic distribution of leading digits. Approximately 30% of numbers start with "1" while only ~4.5% start with "9".

Graph of Benford's Law distribution Expected traffic distribution for data split by the first digit according to Benford’s law

During normal operation, this imbalance is manageable due to the dynamic S3 partitioning. However, during high-load events—backfills after schema changes, recovery from outages, or seasonal spikes in member activity—the lower-digit partitions would likely hit rate limits while higher-digit partitions remained underutilized. We needed a more balanced distribution.

Hash-Based Partitioning

To solve the Benford's law imbalance, we considered hashing IDs to randomize the leading digit:

id_hashed=a1b2c3d4/date=YYYY-MM-DD/file_specifics

Benefits:

  • Perfectly balanced distribution across partitions
  • Maintains stable per-ID partitions across days
  • Scales to full 3,500/5,500 RPS per ID per day

Downside: The only drawback is relatively minor but worth noting: it makes human traversal of the data nearly impossible. To locate a specific piece of data, you'd first need to compute its hash. While WHOOP engineers rarely need to manually browse S3 keys, we wondered if there was a solution that preserved all the benefits of hashing while keeping keys human computable.

Reversed ID Partitioning (The Solution)

Our final approach is remarkably simple: reverse the ID digits.

id_reversed=54321/date=YYYY-MM-DD/file_specifics

Why this works:

Unlike leading digits, the final digit of numbers in naturally occurring datasets is uniformly distributed—it doesn't follow Benford's law. By reversing IDs, we place this uniformly distributed digit first, achieving:

  • Balanced distribution - Each starting digit receives ~10% of traffic
  • Stable partitions - ID-first ordering preserves partitions across days
  • Full scalability - Maintains 3,500/5,500 RPS per ID per day capacity
  • Human-readable keys - Engineers can still navigate the data structure

To find the data for ID 12345, simply look under id_reversed=54321/. No hash lookups, no complex calculations—just basic digit reversal that any engineer can perform mentally during debugging.

Implementation and Migration

Migrating petabytes of production data to a new key structure required careful planning and execution:

  1. Dual-write period: We began writing new data to both the old and new key patterns simultaneously
  2. Backfill pipeline: Utilized Lambda and S3 Batch Operations to move and transform existing object keys
  3. Validation: Verified data integrity and completeness throughout the migration
  4. Gradual cutover: Transitioned read traffic incrementally to build confidence
  5. Legacy cleanup: Removed old keys only after confirming all systems operated correctly with the new pattern

The entire migration completed successfully with full data integrity, continuous uptime, an uninterrupted member experience, and minimal operational overhead.

Results

The impact exceeded our expectations:

  • Error reduction: Millions of monthly 503 errors → ~50 errors per month
  • Reliability improvement: ~99.9% → ~99.999999% success rate (three nines to eight nines)
  • Burst handling: A 10x backfill spike caused zero additional 503 errors

Graph of balanced request distribution by partition Request distribution across the ten starting digits (0-9) after implementing reversed ID partitioning

As shown in the graph above, our request balancing works exceptionally well. Each line represents requests to one starting digit, and their proximity shows balanced traffic. No single partition is overwhelmed, and S3's dynamic partitioning can scale each prefix independently if required.

Conclusion

By understanding S3's dynamic partitioning behavior and applying a simple key reversal strategy, WHOOP achieved a more than 10,000x improvement in error rates—from three nines to eight nines reliability. This required no additional infrastructure costs, no compromise on multi-AZ data durability, and immediately resolved the issue once the migration completed.

The lesson: sometimes the most elegant engineering solutions come from deeply understanding your infrastructure's behavior rather than adding layers of complexity. When facing scale challenges, invest time in understanding the "why" before jumping to expensive or complex solutions.


Interested in solving challenging infrastructure problems at scale? Join the WHOOP engineering team! Check out our open positions.

· 7 min read
Jack Leitch

Glacierbase brings version control and consistency to schema migrations across open table formats, such as Iceberg. It helps WHOOP safely evolve petabyte-scale tables through reviewable, environment-aware migrations that are integrated into our engineering workflow and designed for scale.

Introduction

The data platform at WHOOP operates at a massive scale, powering streaming, batch, and analytical workloads across hundreds of terabytes, and in some cases petabytes, of Iceberg tables. Ensuring schema consistency and table performance across this environment is an ongoing challenge. At this scale, even small inconsistencies in partition specs, table properties, or schema evolution can ripple through downstream systems, affecting everything from analytics to model training. As our Iceberg footprint expanded across Spark streaming, EMR batch pipelines, and Snowflake queries, the team needed a more disciplined way to manage how tables evolve. This led to the creation of Glacierbase, a framework for managing schema and table configuration migrations across any open table format.

Why We Built Glacierbase

As the data lake at WHOOP grew, schema management became increasingly complex. Some tables required refined partitioning strategies for performance, and others evolved to support new features or ML pipelines. At this scale, a single misconfigured partition or inconsistent schema update could result in hours of additional compute time or terabytes of wasted reads. Historically, these changes were applied through manual Spark jobs or SQL scripts, which made it hard to answer questions like:

  • What’s the current schema for this table, and when did it change?
  • Have all environments applied the same updates?
  • Who made a particular change and why?

We needed a framework that would standardize and version control schema evolution, something lightweight but reliable, like Liquibase, but for open table formats.

It’s worth noting that not every Iceberg table at WHOOP uses Glacierbase. Some catalogs are deliberately managed outside it, such as:

  • Ingestion tables that use Iceberg's schema merge where controlled schema drift is expected
  • CDC tables streaming from Postgres into Iceberg, where schemas are stable and automatically maintained by connectors

Glacierbase focuses on everything else: the high-value analytical and model-training datasets that live in our silver and gold layers, where schema change reproducibility, reliability, and traceability are critical. These include curated feature tables, model training datasets, and downstream metrics tables that power analytics and data science workloads across the organization.

How Glacierbase Works

Glacierbase treats table schema changes as immutable, versioned migrations. Each migration exists as a .sql file with metadata headers that describe the author and purpose of the change. An example migration can be seen below:

-- MIGRATION_DESCRIPTION: Increase commit resilience during high-concurrency backfills
-- MIGRATION_AUTHOR: Data Platform Team

ALTER TABLE catalog.namespace.table SET TBLPROPERTIES (
'commit.retry.num-retries'='10',
'commit.retry.min-wait-ms'='1000',
'write.merge.isolation-level'='snapshot'
);

This metadata makes migrations self-documenting and reviewable, allowing reviewers to immediately see why a change was made and how it impacts performance.

Each catalog in a Glacierbase project has its own configuration file that defines how migrations in that catalog are executed. At the catalog level, you specify Spark configuration and dependencies (other backend runtimes are also supported, e.g., Snowflake). This ensures every migration in that catalog runs with the correct execution context. Below is example config for an Iceberg Spark migration executor:

migrationExecutor:
type: spark
conf:
sparkConf:
"spark.sql.catalog.glue": "org.apache.iceberg.spark.SparkCatalog"
...
dependencies:
- "org.apache.iceberg:iceberg-spark-runtime-x.x_x:y.y.y"
...

This model allows Glacierbase to support any open table format simply by swapping the Spark configuration at the catalog level.

Executing Migrations

Glacierbase provides both a Python API and a CLI, enabling teams to run migrations consistently across environments.

  • Run all pending migrations: glacierbase migrate
  • Run migrations for a single catalog: glacierbase migrate --catalog analytics
  • List pending migrations: glacierbase pending --catalog analytics

Each migration run logs execution metadata and ensures ordering. Glacierbase also acquires a lock before starting a migration, preventing concurrent schema updates on the same catalog. If another process is running, Glacierbase raises a clear concurrency error to guarantee atomic, isolated schema evolution.

Key Features

  • Support for all open table formats: Glacierbase isn’t limited to Iceberg. The catalog configuration controls which backend runtime engine and dependencies are used for migrations, making the tool format-agnostic.
  • Environment-based variable injection: Migrations can include environment-dependent constants using templated variables. For example, Iceberg’s hidden bucket partition size can vary across environments:

    partitioned by (
    day(timestamp),
    bucket({{ .variables.catalog.namespace.tableName.bucketSize }}, id)
    )

    This lets us tune partitioning for each environment without maintaining separate migration files.

  • Immutable migrations: Once a migration has been executed, Glacierbase stores its file hash. If that hash changes, the system raises an error. Any update must be introduced as a new migration file, which preserves auditability and reproducibility.

Integrating Glacierbase into Our Workflow

Glacierbase fits seamlessly into our existing engineering processes. A typical lifecycle looks like this:

  1. A new feature or model requires a schema update.
  2. The engineer creates a migration file alongside the code change.
  3. The pull request includes both the migration and the related code, ensuring alignment between logic and data structure.
  4. Once approved, the PR is merged and deployed to dev, then to prod (with future CI/CD automation planned).

This process ensures all schema changes are peer-reviewed, tested, and consistently deployed with the same rigor applied to our application code.

Why This Matters for Iceberg

At the scale that WHOOP operates, schema and partition correctness directly determine performance and reliability. Even minor misconfigurations can lead to:

  • Full table scans instead of predicate pushdowns
  • Ballooning of metadata file sizes due to incorrectly configured write.metadata.metrics
  • Schema drift between environments that breaks downstream consumers

Glacierbase helps prevent these issues by enforcing version control and consistency. Combined with Iceberg’s atomic commits, it provides a safe, auditable, and automated way to evolve tables across environments.

The Builder Mindset

At WHOOP, we build tools that make complex systems simpler and safer. Glacierbase embodies that mindset as a disciplined, auditable framework that transforms schema management into a version-controlled process.

By treating table migrations as code, Glacierbase brings reliability and scalability to our Iceberg ecosystem.

Future Directions

We hope to continue to expand Glacierbase as a foundational part of the data lake ecosystem at WHOOP. Two major areas of focus are in progress:

  1. CI/CD Integration for Migrations Fully automate the migration lifecycle through CI/CD. The goal is to have migrations executed automatically as part of the deployment pipeline, validated in test environments, and then applied to production when changes are merged to main. This will make schema evolution completely continuous, reducing operational overhead and ensuring that environments stay perfectly aligned.

  2. Migration to a REST Iceberg Catalog We plan to move our Iceberg catalogs to a REST based Iceberg catalog. This will allow us to enforce more granular RBAC controls across the data lake. For example, we’ll be able to ensure that only Glacierbase can perform structural changes (except designated break-glass roles). This separation of privileges will strengthen governance, improve safety, and make schema management even more robust as our data platform continues to scale.


Love working with huge amounts of data and building systems that push technical limits? The Data Platform team at WHOOP is always exploring new ways to scale, optimize, and innovate — check out our open positions to be part of it.