WHOOP Developer Platform - Designing a resilient webhook system

In this post of the WHOOP Developer Platform series, readers learn about design decisions for developing a resilient, multitenant webhook solution. Check out the other posts in the series “Designing the WHOOP API” and “Your body doesn’t know what day it is”

At WHOOP, near real-time data is core to our systems. We deal with a stream of physiological data like heart rates, which is then distilled into insights. Being able to provide time-relevant recommendations doesn’t fit very well into a standard API flow because it requires frequent polling of the endpoints for changes in the data. Many API calls when polling return data that was already known because data like sleeps and workouts only change a few times per day for each user. This causes wasted work both for your app that integrates with WHOOP as well as our systems.

To address this problem for our API platform, we elected to implement the industry standard pattern of webhooks. Resiliency was at the forefront of our minds while designing our webhook solution, and outlined below are a few of the major aspects we considered.

Multitenancy

There are a few different definitions of multitenancy, but at WHOOP we prefer to think of it as

A user experiences the system as if they were the only user, regardless of how many use the system.

We wanted to design the solution such that one webhook listener could not negatively impact any other webhook listener. We could give each webhook receiver its own dedicated infrastructure, but then costs would grow linearly with the number of webhook receivers and quickly become prohibitively expensive.

So, how do we accomplish this multitenancy requirement we have set without doing that? At WHOOP we make use of Kafka streams for much of our data processing, so we can consume from one of our existing data streams for sending webhooks. However, if we naively sent the webhooks directly from that stream consumer we would need to send multiple different webhook POST requests for each event we consume. In this case, one webhook listener being down would block the processing of the rest of the stream because streams require in-order processing.

Instead, for this application, we elected to enqueue the messages we need to send, allowing us to handle each webhook request independently. Additionally, we made sure to set a relatively low maximum request duration to ensure that if a system does take a long time to respond, it isn’t blocking the processing of other events, further emphasizing our multitenant posture.

Retries

While designing the system, we made sure to keep in mind that at some point, a webhook we are sending is going to fail. We need to prepare for a network blip, a 3rd party being down, or any number of other issues. To that end, we knew from the start that we wanted to implement a reasonable retry policy.

We could have chosen to retry immediately upon encountering a failure, but for services struggling to keep up with the load this would cause a retry storm, exacerbating the issue. By using exponential backoff with jitter instead, we are able to avoid causing further issues for services attempting to recover from a period of downtime.

Our team elected to use 5 retries over the course of 1 hour to emphasize our webhook solution is about data freshness rather than replacing the API entirely. For us, this seemed to be a sensible balance between flexibility for periods of lower availability for 3rd party apps while also not causing excessive work for our systems when a 3rd party app is down for an extended period. However, retries are not indefinite. Build a process to handle when your application misses events to ensure that even if your app is down for an extended time you can recover the data missed.

Conclusion

While it would be much too much to go through every aspect of designing a system like this, the considerations above were some of the major pieces that shaped the decisions made for our webhook system. Hopefully, this can help shed some light on our decisions, as well as highlight some of the major pieces you may want to consider when designing your own webhook system.

If you would like to utilize our webhook system for yourself, you can! Just log in to our developer dashboard to get started creating your app, and check out our webhook documentation for more info on our webhook system specifically.

If you are interested in building infrastructure at WHOOP check out our open positions.

Multitenancy​

Retries​

Conclusion​

Multitenancy

Retries

Conclusion