پرش به مطلب اصلی

9 پست با برچسب "ai"

مشاهده تمام برچسب‌ها

· خواندن 9 دقیقه
Ferguson Watkins
Danny Grimmig
Viviano Cantu

A few months ago, we wanted to build a context-aware AI coach, one that could tailor advice based on what screen you're viewing in the WHOOP app. Sleep screen? Recovery tips. Activity screen? Training guidance. Sounds simple, but prompt management was becoming a bottleneck.

The prompts couldn't resolve logic on their own, so logic had to either be wired up in code or left up to the model to figure out: "You are WHOOP, a personal wellness and fitness assistant. If the member is looking at Sleep, do X. If they're looking at Strain, do Y..."

What if the prompt itself could be dynamic? What if we could write modular, reusable prompt components that assembled themselves at runtime?

That's HPML (Hyper Prompt Markup Language), a templating language purpose-built for dynamic, composable LLM prompts.

· خواندن 10 دقیقه
Geoff Cisler
Douglas Schonholtz
Mihir Walvekar
Ferguson Watkins

Across the industry, we see AI features shipped on hope alone. At WHOOP, we ship with data and security in mind. In order to support over 500 unique agents, we built an evaluation framework that treats LLMs like the statistical, noisy systems they are. Here's exactly how it works.

The Problem

We've built AI Studio to enable anyone at WHOOP to develop and interact with our homegrown Agents, resulting in an explosion of more than 500 of them across virtually every screen in the app. But as we reduced the friction to build Agents, the new bottleneck became testing them. Manual dogfooding turns into an endless game of whack-a-mole. A new Agent might work well for most cases, but for some percentage of events, it might save incorrect dates or store trivial information that doesn't add member value. And once we identify a problem, prompt-tweaking doesn't tell us if it's truly fixed or just in the cases we manually tested.

The underlying truth is that an LLM is a statistical, non-deterministic machine: you cannot test it like traditional software. You MUST track true and false positives, error rates, and acceptable failure thresholds at scale. That's what we built.

Starting Small, Then Scaling

We started humbly two years ago in spreadsheets, creating thousands of test questions, expected answers, and synthetic "Personas" (reproducible synthetic data with profiles like an IN_THE_GREEN member with 15 Green Recoveries above 80%). This spreadsheet-driven framework supported our team and our singular main Agent for over a year, but it was painful as we hit the functional limits of the spreadsheets and inflexibility of the hardcoded metrics we had created. And as the number of Agents exploded, we needed something that could scale.

The Next Generation

To solve these scaling problems, we built a dedicated evaluation platform directly into AI Studio, our internal tool for building and managing secure Agents.

This new framework moved us away from static spreadsheets and into a dynamic, integrated workflow. Now, instead of copy-pasting rows, we can:

  • Define Input Sets: We needed complex inputs to simulate real members' interactions. We built a service to create synthetic member templates called "Personas", with different data profiles. We developed a tool to simulate back-and-forth conversations with an agent. Inputs can also just be text and images and optionally include the expected ground truth answer, but combining all of these forms a complex Input Set which can be reused across different Agents.
  • Run Evaluations on Demand: With a single click, an Engineer or Product Manager can run a full suite of tests with a new prompt version.
  • Customize Metrics: Anyone can design a new metric easily using a robust list of metric types, both LLM-based and traditional text analysis. These metrics vary from validation that a response includes a follow-up question or that the data in the response is correct or whether a sub agent/tool has been called in creating the response. These all help maintain a high level of quality and security across our agents!
  • Analyze Results in Real-Time: Aggregated metrics results and individual trace-level details are available immediately, allowing for rapid iteration cycles. The results can be filtered and parsed to identify exactly where the agent fell short and where it excelled!

This unlocked fast, repeatable evaluations for every Agent we ship, ensuring that we weren't just "feeling good" about a change, but actually measuring its impact and enforcing testing and security gates before anything reaches members. This was crucial for our big feature rollout: Memory.

An Example Evaluation: Memory

We have an LLM Agent which saves memory "nuggets" as a member interacts with WHOOP in the app. The goal is to accumulate individually personalized context for WHOOP to reference wherever it may be valuable. On every message from the member, the Agent asks, "Is this worth remembering? If so, remember it." The Agent will store the memory, along with any 'start' and 'end' dates it was able to derive. We then pull any generated context into WHOOP for every conversation in the app. We filter based on 'start' and 'end' dates, only pulling in 90 days' worth of context, and if there is no date associated, we pull it in "just in case."

Before rolling this out to members, we used our new Evaluation platform to stress-test the Agent and make sure it was ready.

Discovering the Issue

During pre-launch testing, we discovered the Agent was too ambitious. Left unchecked, it would have:

  • Saved memories on nearly every interaction, most without 'start' or 'end' dates -- meaning they'd accumulate indefinitely.
  • Stored context nuggets that weren't remembering anything important -- not useful as personalization.

Given these problems, we decided to spin up an Evaluation.

Creating Metrics

We used a recall-style metric (how often we saved what we should have saved) to compare the ground truth of what we expect to be saved to what is actually saved. For new metrics, we needed to create some LLM-as-a-judge and tool-calling verification metrics to confirm that this Agent's results matched our expectations.

New Metric Creation Page

Example metric to show creation. We support a LLM-as-a-judge, and a variety of pattern matching across different message types.

Input Sets

An anonymized selection of real questions from our team (expanded with AI assistance) gave us hundreds of questions covering member queries in three categories:

  • Definitely should be remembered ("I am training for a half-marathon in November")
  • Definitely should NOT be remembered ("What was my Recovery score this morning?")
  • General questions ("Is WHOOP waterproof?")

We then went through and created expected behaviors (ground truths) for each of these questions. This allowed us to evaluate Should Save Context Rate: Should the Agent save something? What should the Agent have saved?

Getting a Baseline

With some metrics created, we ran an initial baseline against the unreleased Agent and confirmed what we suspected: Memory wasn't ready to ship.

Very bad Memory Evaluation baseline

Pre-launch baseline: the Agent was remembering something 100% of the time, and not saving dates -- so memories would never expire!

MetricRateNotes
Did Save Context Rate99%Far too aggressive! It's saving on nearly every interaction
Should Save Context Rate34%Only 34% alignment with expected behavior
Include Start Date Rate43.4%Room for improvement on temporal context
Include End Date Rate8.1%Almost never setting expiration and memories accumulate indefinitely

Iterating: When "Better" Wasn't Better

Here's where the framework proved its value. After some prompting tweaks and local tests, things felt like they were improving. Many teams would ship at this point, run an A/B test, and wait the required few days to see if metrics are improving.

At WHOOP, we run an Evaluation, and this time it caught a critical regression before it reached a single member:

MetricRateNotes
Did Save Context Rate100%Even more aggressive after changes
Should Save Context Rate31%Worse than baseline
Include Start Date Rate29.0%Regressed
Include End Date Rate15.0%One bright spot, this improved

We thought we were improving, but the data showed otherwise. This is exactly the kind of silent regression that plagues AI deployments across the industry. Teams could ship changes that "feel right" only to discover problems weeks later through member feedback, and we want to avoid that.

Our Evaluation framework caught this in minutes rather than days to weeks. We iterated until we actually improved:

After a few more iterations, we finally got to these numbers:

MetricRateNotes
Did Save Context Rate46%Much more selective, saving only when meaningful
Should Save Context Rate84%Much stronger alignment with expected behavior
Include Start Date Rate37.0%37% of all interactions include a start date; dividing 37% by the 46% save rate means ~80% of saved nuggets include start dates
Include End Date Rate12.0%12% of all interactions include an end date; dividing 12% by the 46% save rate means ~26% of saved nuggets include end dates

Beyond the numbers, careful and considerate review of individual traces showed the content of the context nuggets themselves got smarter, remembering meaningful details rather than noise.

The Outcome

We did not ship these issues to production because the Evaluation framework caught it all during development, and only after we were confident in the numbers did we roll Memory out. We've since added several more layers that make it even more robust.

Beyond Test Suites: Agent Observability

Evaluations tell you how an Agent performs when you intentionally test it. But what about the other 99% of interactions happening in production?

We extended the framework to close that gap. Our Agent Observability extends the framework beyond test suites to monitor production quality over time -- no separate tooling, no duplicated logic. This means the metrics we trust in testing are the exact same metrics monitoring production traffic. When traffic patterns shift in ways our test sets didn't anticipate, we see it in hours rather than discovering it through feedback days later. We can compare quality metrics across Agent versions before and after a rollout.

Confidence at Scale

From spreadsheets to an integrated evaluation platform to production observability, each layer compounds on the last. Evaluations catch regressions before they ship. Observability catches the ones that only surface under real-world conditions. Together, they give us something rare in AI development: confidence.

Quality is only half of it -- security is non-negotiable. Our Evaluation framework includes metrics that verify Agents correctly refuse dangerous or privacy-violating requests, and we gate deployments on those results. An Agent that gives great answers but leaks member data or ignores safety boundaries doesn't ship. Period.

This infrastructure is a strategic asset. When a new model drops, we can validate quality and safety in hours, not weeks. When a prompt change improves one metric, we catch regressions in others before they reach production. This is the discipline that separates teams who ship AI reliably from those constantly firefighting.

There's More

Evaluations and observability are just one piece of what we've built. AI Studio also includes automated Evaluation CI gates that block deployments when they indicate a potential regression. Beyond that, we have RAG quality metrics, a custom prompt templating language (HPML), reusable Snippets, Sub Agents, Goals, and more. We'll dive deeper into these in future posts.

The bottom line: Unit tests aren't enough for non-deterministic systems, but you can evaluate them rigorously. We built the infrastructure to do that at scale: 500+ Agents, 2000+ evaluations and growing. That's how we ship with confidence.


Want to build the future of health and performance with AI? WHOOP is hiring engineers, product managers, and AI researchers who are passionate about using technology to unlock human performance.

· خواندن 5 دقیقه
Arshia Mathur

When I joined WHOOP as a Backend Software Engineer co-op, I was excited about AI but mostly in the abstract. I had seen demos and read about emerging models. What I had not experienced yet was what it takes to bring AI into a real product in a way that feels trustworthy, secure, supportive, and genuinely helpful to members. At WHOOP, I had the chance to work on exactly that. Over my co-op, I contributed to three major areas of the AI experience: expanding AI across the app, designing onboarding for new members, and developing proactive insights after workouts. Along the way, I learned that the hard part is not just getting AI to “work”, it is making it feel human, responsible, and aligned with the mission to unlock human performance.

· خواندن 7 دقیقه
Douglas Schonholtz

The latest model isn't always the best model for every use case.

Before we can chat about GPT-5.1, we have to talk about GPT-5. When GPT-5 dropped, we were excited. Our evals showed GPT-5 was clearly more intelligent and we quickly rolled it out for our high intelligence agents and high-reasoning tasks where that capability shines. But for our low-latency chat, GPT-5's minimal reasoning mode actually underperformed GPT-4.1 on our evals. Different use cases, different results. Our per use case evaluations allowed us to immediately make these determinations.

We shared those findings directly with OpenAI in our weekly call with them and at Dev Day. At DevDay, we spent time chatting with their engineers walking through our eval results and discussing what we needed for latency-sensitive applications like chat. That collaboration mattered.

GPT-5.1 shipped with a new none reasoning mode that addressed exactly what we'd flagged. We ran our evals again, and this time the results were different. Within a week, we had validated it against over 4,000 test cases, A/B tested in production, and rolled it out to everyone.

22% faster responses. 24% more positive feedback. 42% lower costs.

Here's exactly what we evaluated, what we found, and why we shipped.

Evals First: Validating GPT-5.1

Before touching production, we ran our full evaluation suite against GPT-5.1, comparing it to our GPT-4.1 baseline for our core chat use case. Here's what we found:

Evaluation comparison between GPT-4.1 baseline and GPT-5.1

Our internal evaluation dashboard comparing GPT-4.1 (baseline) against GPT-5.1 across over 4,000 test cases

MetricGPT-4.1 BaselineGPT-5.1Change
Answer Relevance0.580.63+0.05
Completeness0.970.98+0.01
Data Driven42.6%56.6%+14.00
Recall0.700.72+0.02
Follow-up Question98.8%98.8%0.00
Precision0.280.25-0.03
Response Word Count148.23243.95+95.72

The headline numbers were impressive: Answer Relevance up ~5%, Data-Driven responses up 14 percentage points, and Recall improved by 2 points. GPT-5.1 was consistently surfacing more personalized health insights. The "Data Driven" metric measures how often responses incorporate actual individualized data rather than generic advice—jumping from 42.6% to 56.6% is a game-changer for personalized coaching.

What's driving that Data Driven jump? Significantly higher (and more accurate) tool call usage, primarily focused on AIQL queries—our internal query language for pulling individualized data like sleep, strain, and recovery. More data queries mean more personalized context in every response.

We measured answer relevance, recall, data usage, tool accuracy, and conversation dynamics. There are trade-offs: GPT-5.1 shows a slightly lower precision score (0.25 vs 0.28) and produces somewhat longer responses, even after light prompt tuning to reduce verbosity. But those extra tokens are doing useful work. The model infers intent better and explains why, not just what. The added detail shows up primarily in complex coaching conversations that deserve deeper explanation, not in simple one-off questions. When we reviewed individual traces, the trade-off was clearly worth it.

The Real Test: Production

Evals told us GPT-5.1 was ready. But evals only tell part of the story—the real test is always production.

We rolled GPT-5.1 out to roughly 10% of production load as a controlled A/B test, monitoring latency, tool usage, cache hit rates, user feedback, and token spend. The live data told an even better story than our evals predicted:

22% Faster Responses

Here's what surprised us: despite providing more comprehensive answers and using tools more thoroughly, GPT-5.1's time to first token dropped significantly. This defies the usual trend where "smarter" models are slower.

PercentileGPT-4.1GPT-5.1Improvement
p50 (median)1.53s0.98s36% faster
p902.88s1.96s32% faster
p995.83s4.55s22% faster

The median response now starts streaming in under a second. The model itself is faster: 34% higher token velocity (98/s → 131/s) and 18% faster time-to-first-token (841ms → 688ms). That's not our optimization. It's a better model on better infrastructure.

Service dashboard showing GPT-4.1 baseline performance

GPT-4.1 baseline: 98 tokens/sec, 841ms time to first token

Service dashboard showing GPT-5.1 improved performance

GPT-5.1: 131 tokens/sec, 688ms time to first token

Amplified by Better Caching

On top of the model improvements, we saw dramatically improved caching efficiency.

Caching rate improvement with GPT-5.1

Cache hit rates improved by 50% with GPT-5.1

OpenAI's improved GPT-5.1 infrastructure meant a 50% improvement in cache hit rates. We didn't change anything on our end; the model just automatically has a higher cache hit rate for us.

Cached tokens are 10x cheaper, and our coaching experience has an extremely high input-to-output token ratio—lots of context for every answer. That combination makes caching efficiency a massive cost lever.

24% More Positive Feedback

Users prefer 5.1 over 4.1 with a 24% increase in positive feedback (thumbs up) and a corresponding decrease in negative feedback. Faster responses plus better answers equals deeper engagement.

42% Lower Token Costs

If GPT-5.1 produces wordier responses (+96 words on average), how did costs go down? That 50% caching improvement. Serving significantly more requests from cache at 10x lower cost, combined with cheaper overall prices than 4.1, drove lower total cost.

At scale, these efficiency gains translate to significant savings we can reinvest into building better AI experiences.

From 10% to 100% Rollout

Based on these results, we rolled GPT-5.1 out to 100% of production traffic. The metrics have held steady, confirming what our evals and A/B test predicted.

What We Learned

A year ago, evaluating a new model would have taken weeks. Now we can eval in hours and ship the same day.

When GPT-5 came out, we knew within hours it wasn't the right fit for low-latency chat. When GPT-5.1 came out, we knew within hours it was. That speed comes from investing in use-case-specific evaluation infrastructure. As model release cadence accelerates, this kind of rapid validation isn't optional. It's how you lead.

Our partnership with OpenAI made this launch better and helped make 5.1 the model we needed. Our in-house eval framework gave us the ability to validate any new model against our specific use cases, catch regressions before they hit production, and ship with confidence.

With 56.6% data-driven responses (up from 42.6%), faster responses that make coaching feel like a conversation, and lower costs we can reinvest into building better AI experiences.

If you want more context on how we build and ship agents at WHOOP, check out our earlier post on AI Studio, "From Idea To Agent In Less Than Ten Minutes".

What's Next

We're not done. The ultimate metric isn't eval scores or even user feedback. It's whether better AI responses lead to better health outcomes. That's where we're headed.

The bottom line: We shipped a new model in a week, validated by our custom eval framework. 22% faster responses, 24% more positive feedback, 42% lower costs. That's what rapid iteration looks like.


Want to build the future of health and performance with AI? WHOOP is hiring engineers, product managers, and AI researchers who are passionate about using technology to unlock human performance.

· خواندن 7 دقیقه
Justin Coon

At WHOOP, we've been at the forefront of applied AI not by competing in the model race, but by building on top of it with our unmatched physiological data and domain expertise. While others were still debating the potential of LLMs, we had already spent years understanding how to integrate these models with real-world systems and data pipelines.

This early investment paid dividends. By the time LLMs started to become truly powerful, we weren't starting from scratch. We had years of hard-won expertise in prompt engineering, evals, model selection, and most critically, applying AI to physiological insights with robust privacy and security guardrails.

AI Studio

In the years we spent building LLM agents, we built all kinds of one-off scripts and tools that could remain flexible while we invent new ways of doing things and allow us to keep up with the gen-ai-model-of-the-month release cadence we were seeing. Integrating new data sources and rearranging logic to provide the right context to the LLMs meant that even minor changes to the way our core agents work could take weeks.

When coding agents Cursor and Claude Code came out and the coding model race really heated up, we realized we could finally build the LLM Agent IDE we'd always dreamed of. We already had a strong understanding of what the core abstractable pieces of LLM agents were (system instructions, model, and tools), we just needed to build a simple backend to store it and a simple frontend to manage it. In a week or so, we had built the first iteration of AI Studio.

AI Studio Agent Editor Interface

The AI Studio agent editor showing system prompts, model selection, and tool configuration

It has a lot more features today than it did back then, but the original version nailed down the core need: we could now go from "what if we build an agent that does X" to actually testing that agent in less than 10 minutes. Often times someone will mention a really great idea at the start of standup and then by the end of it they had built a working prototype and sent it to all of our phones to try out. This level of rapid iteration has unlocked a creativity and pace we have never seen before. More importantly, it capitalizes on the nature of generative AI: 95% of the value can come in the first 5% of effort, and that last 5% of polish will take 95% of the effort. So we prioritize trying a lot of ideas and failing fast, never wasting time polishing something that won't work.

Agent Iteration Growth Chart

Growth of agent iterations over 6 months

After 6 months, we've created and tested over 2500 iterations of different agents, and safely deployed 235 of them to production across 41 live agents like WHOOP Coach or Day In Review.

The key word here is "safely." While AI Studio makes internal iteration quick and frictionless, going from idea to prototype in minutes, we haven't compromised on security or privacy when it comes to production. Every deployment goes through a built-in diff, approval, and deployment flow that ensures proper review of changes, adherence to our strict privacy policies, and validation of security guardrails. Most critically, the platform ensures that PII is never sent to model providers. This dual approach lets our teams move fast where it matters (experimentation and prototyping) while maintaining enterprise-grade security where it counts.

What Makes AI Studio Different

Traditional AI development requires deep technical expertise at every layer of the stack. With AI Studio, we've abstracted the complexity while preserving flexibility:

  • Visual Agent Builder: Define your agent's system prompt, select a model, and configure tools, all through a clean web interface
  • Integrated Testing Environment: Chat with your agent in real-time, debug interactions, and iterate on prompts without deploying anything
  • One-Click Tool Access: Connect to WHOOP's data ecosystem through pre-built tools for fetching or writing to things like weekly plan, healthspan, and activities with no API wrangling required
  • Built-in Evaluation Framework: Test agent performance systematically with our integrated eval system

The result has fundamentally changed how we think about AI at WHOOP. It's no longer a specialized capability reserved for our AI team—anyone from product managers to data scientists to health coaches can prototype agents in minutes and deploy production-ready versions within a day. In fact, with the entire development cycle now being no-code, our product team are quickly becoming our strongest prompt engineers (shoutout Anjali Ahuja, Mahek Modi, Alexi Coffey, Camerin Rawson!).

The Game Changer: Inline Tools

As we pushed deeper into making agent iteration ultra-fast, we invented something we call inline tools, a breakthrough that's transformed how we build agents.

Traditional agent architectures separate the prompt from tool invocations. The LLM has to explicitly decide when to call a tool, format the request correctly, wait for the response, and then continue. This creates latency, complexity, and countless edge cases.

Inline tools flip this approach. We've created a markup language that allows us to trigger agent tools directly inside the system prompt itself. Here's what this looks like in practice:

Current time: {{@tool1}}
Today's recovery score: {{@tool2}}

would become

Current time: <result of tool1>
Today's recovery score: <result of tool2>

When the agent loads, these inline tool calls execute in parallel, injecting real-time, personalized data directly into the context. The LLM doesn't need to "decide" to fetch this data. It's already there, reducing latency and making the interaction feel instantaneous.

This seemingly simple innovation has profound implications:

  • Faster Response Times: No round-trip tool calls means near-instant responses
  • Simpler Mental Model: Developers think about data as part of the prompt, not as external API calls
  • Better Consistency: Data is always present in the expected format
  • Easier Testing: Debug prompts with real data injected, not placeholder variables

The Compound Effect of Internal Tools

There's a broader lesson here about the compound value of internal developer tools. Every hour we invest in making AI Studio better pays dividends across every team, every agent, and every customer interaction. When you make something 10x easier, you don't just get 10x more of it. You unlock entirely new categories of innovation from people who couldn't participate before.

As LLMs continue to evolve and new models emerge, AI Studio ensures we can adopt them instantly. When a new model drops, every agent in our system can be upgraded with a dropdown selection. When we identify a new pattern that works well, it becomes a reusable component available to everyone.

This kind of leverage is now accessible to any team. Tools like Cursor and Claude Code have made it incredibly easy and low friction to spin up internal tools tailor-made to your company's stack and needs. The value compounds quickly: build once, benefit everywhere. At WHOOP, AI Studio has become the foundation for how we ship AI features—and it all started with asking what would make our own lives easier.

The Future is Already Here

While the tech world debates the future of AI agents, we're already living in it at WHOOP. Our members interact with AI agents dozens of times per day. They just experience them as helpful features, not "AI." Coach Chat provides personalized guidance. Daily Outlook surfaces insights from their data. Recovery recommendations adapt to their unique physiology.

Behind each of these experiences is an agent built in AI Studio, many of them created by people who had never worked with LLMs before. That's the real revolution: not the models themselves, but the platforms that make them accessible to everyone.

What's Next

What we've shared today is just the tip of the iceberg. The most powerful capabilities we've built into AI Studio are still under wraps—innovations that have fundamentally changed how we think about AI agents and what they're capable of.

Stay tuned. We'll be pulling back the curtain on these breakthroughs soon. And if you can't wait to see what we're building behind the scenes, join us!


Want to build the future of health and performance with AI? WHOOP is hiring engineers, product managers, and AI researchers who are passionate about using technology to unlock human performance.

· خواندن 9 دقیقه
Arna Bhattacharya
Jacob Friedman

Internships and Co-ops at WHOOP are more than working on temporary projects - they give students the chance to ship real work, explore new technologies, and see their impact firsthand. In the reflections that follow, two interns share their journeys: one as a college SWE intern building core product features, and another as a high school SWE intern who transformed an idea into a redesigned engineering blog. Together, their stories highlight how WHOOP empowers interns and co-ops of all backgrounds to learn, build, and belong.

· خواندن 13 دقیقه
Vinay Raghu

MCP! It’s everywhere and it's confusing. In my opinion, the usual explanations of “it’s like a USB-C adapter” assumes a certain level of proficiency in this space. If, like me, you have been struggling to keep up with the AI firehose of information and hype, this article is for you. It is a gentle introduction to MCP and all things AI so you can understand some key terminologies and architectures.

· خواندن 6 دقیقه
Mark Flores
Hack Week 2025 logo

Last June, we hosted our biggest biggest hackathon at WHOOP yet, and our first one focused on using AI. The event was open to the entire company: product managers, designers, operations, scientists, engineers, marketing, and more. The results were beyond what I could have imagined.

· خواندن 7 دقیقه
Viviano Cantu

Agentic coding is on the rise, and the productivity gains from it are real. There are tons of success stories of non-programmers "vibe coding" SaaS apps to $50k MRR. But, there are also precautionary tales of those same SaaS apps leaking API keys, racking up unnecessary server costs, and growing into giant spaghetti code that even Cursor can't fix.