7 篇文章含有標籤「ai」

Building AI Experiences at WHOOP: What I Learned as a Co-op

2026年1月15日 · 閱讀時間約 5 分鐘

Backend Software Engineer Co-op

When I joined WHOOP as a Backend Software Engineer co-op, I was excited about AI but mostly in the abstract. I had seen demos and read about emerging models. What I had not experienced yet was what it takes to bring AI into a real product in a way that feels trustworthy, secure, supportive, and genuinely helpful to members. At WHOOP, I had the chance to work on exactly that. Over my co-op, I contributed to three major areas of the AI experience: expanding AI across the app, designing onboarding for new members, and developing proactive insights after workouts. Along the way, I learned that the hard part is not just getting AI to “work”, it is making it feel human, responsible, and aligned with the mission to unlock human performance.

We Shipped GPT-5.1 in a Week. Here's How We Validated It.

2025年12月4日 · 閱讀時間約 7 分鐘

Douglas Schonholtz

Senior AI Engineer

The latest model isn't always the best model for every use case.

Before we can chat about GPT-5.1, we have to talk about GPT-5. When GPT-5 dropped, we were excited. Our evals showed GPT-5 was clearly more intelligent and we quickly rolled it out for our high intelligence agents and high-reasoning tasks where that capability shines. But for our low-latency chat, GPT-5's minimal reasoning mode actually underperformed GPT-4.1 on our evals. Different use cases, different results. Our per use case evaluations allowed us to immediately make these determinations.

We shared those findings directly with OpenAI in our weekly call with them and at Dev Day. At DevDay, we spent time chatting with their engineers walking through our eval results and discussing what we needed for latency-sensitive applications like chat. That collaboration mattered.

GPT-5.1 shipped with a new none reasoning mode that addressed exactly what we'd flagged. We ran our evals again, and this time the results were different. Within a week, we had validated it against over 4,000 test cases, A/B tested in production, and rolled it out to everyone.

22% faster responses. 24% more positive feedback. 42% lower costs.

Here's exactly what we evaluated, what we found, and why we shipped.

Evals First: Validating GPT-5.1

Before touching production, we ran our full evaluation suite against GPT-5.1, comparing it to our GPT-4.1 baseline for our core chat use case. Here's what we found:

Evaluation comparison between GPT-4.1 baseline and GPT-5.1

Our internal evaluation dashboard comparing GPT-4.1 (baseline) against GPT-5.1 across over 4,000 test cases

Metric	GPT-4.1 Baseline	GPT-5.1	Change
Answer Relevance	0.58	0.63	+0.05
Completeness	0.97	0.98	+0.01
Data Driven	42.6%	56.6%	+14.00
Recall	0.70	0.72	+0.02
Follow-up Question	98.8%	98.8%	0.00
Precision	0.28	0.25	-0.03
Response Word Count	148.23	243.95	+95.72

The headline numbers were impressive: Answer Relevance up ~5%, Data-Driven responses up 14 percentage points, and Recall improved by 2 points. GPT-5.1 was consistently surfacing more personalized health insights. The "Data Driven" metric measures how often responses incorporate actual individualized data rather than generic advice—jumping from 42.6% to 56.6% is a game-changer for personalized coaching.

What's driving that Data Driven jump? Significantly higher (and more accurate) tool call usage, primarily focused on AIQL queries—our internal query language for pulling individualized data like sleep, strain, and recovery. More data queries mean more personalized context in every response.

We measured answer relevance, recall, data usage, tool accuracy, and conversation dynamics. There are trade-offs: GPT-5.1 shows a slightly lower precision score (0.25 vs 0.28) and produces somewhat longer responses, even after light prompt tuning to reduce verbosity. But those extra tokens are doing useful work. The model infers intent better and explains why, not just what. The added detail shows up primarily in complex coaching conversations that deserve deeper explanation, not in simple one-off questions. When we reviewed individual traces, the trade-off was clearly worth it.

The Real Test: Production

Evals told us GPT-5.1 was ready. But evals only tell part of the story—the real test is always production.

We rolled GPT-5.1 out to roughly 10% of production load as a controlled A/B test, monitoring latency, tool usage, cache hit rates, user feedback, and token spend. The live data told an even better story than our evals predicted:

22% Faster Responses

Here's what surprised us: despite providing more comprehensive answers and using tools more thoroughly, GPT-5.1's time to first token dropped significantly. This defies the usual trend where "smarter" models are slower.

Percentile	GPT-4.1	GPT-5.1	Improvement
p50 (median)	1.53s	0.98s	36% faster
p90	2.88s	1.96s	32% faster
p99	5.83s	4.55s	22% faster

The median response now starts streaming in under a second. The model itself is faster: 34% higher token velocity (98/s → 131/s) and 18% faster time-to-first-token (841ms → 688ms). That's not our optimization. It's a better model on better infrastructure.

Service dashboard showing GPT-4.1 baseline performance

GPT-4.1 baseline: 98 tokens/sec, 841ms time to first token

Service dashboard showing GPT-5.1 improved performance

GPT-5.1: 131 tokens/sec, 688ms time to first token

Amplified by Better Caching

On top of the model improvements, we saw dramatically improved caching efficiency.

Cache hit rates improved by 50% with GPT-5.1

OpenAI's improved GPT-5.1 infrastructure meant a 50% improvement in cache hit rates. We didn't change anything on our end; the model just automatically has a higher cache hit rate for us.

Cached tokens are 10x cheaper, and our coaching experience has an extremely high input-to-output token ratio—lots of context for every answer. That combination makes caching efficiency a massive cost lever.

24% More Positive Feedback

Users prefer 5.1 over 4.1 with a 24% increase in positive feedback (thumbs up) and a corresponding decrease in negative feedback. Faster responses plus better answers equals deeper engagement.

42% Lower Token Costs

If GPT-5.1 produces wordier responses (+96 words on average), how did costs go down? That 50% caching improvement. Serving significantly more requests from cache at 10x lower cost, combined with cheaper overall prices than 4.1, drove lower total cost.

At scale, these efficiency gains translate to significant savings we can reinvest into building better AI experiences.

From 10% to 100% Rollout

Based on these results, we rolled GPT-5.1 out to 100% of production traffic. The metrics have held steady, confirming what our evals and A/B test predicted.

What We Learned

A year ago, evaluating a new model would have taken weeks. Now we can eval in hours and ship the same day.

When GPT-5 came out, we knew within hours it wasn't the right fit for low-latency chat. When GPT-5.1 came out, we knew within hours it was. That speed comes from investing in use-case-specific evaluation infrastructure. As model release cadence accelerates, this kind of rapid validation isn't optional. It's how you lead.

Our partnership with OpenAI made this launch better and helped make 5.1 the model we needed. Our in-house eval framework gave us the ability to validate any new model against our specific use cases, catch regressions before they hit production, and ship with confidence.

With 56.6% data-driven responses (up from 42.6%), faster responses that make coaching feel like a conversation, and lower costs we can reinvest into building better AI experiences.

If you want more context on how we build and ship agents at WHOOP, check out our earlier post on AI Studio, "From Idea To Agent In Less Than Ten Minutes".

What's Next

We're not done. The ultimate metric isn't eval scores or even user feedback. It's whether better AI responses lead to better health outcomes. That's where we're headed.

The bottom line: We shipped a new model in a week, validated by our custom eval framework. 22% faster responses, 24% more positive feedback, 42% lower costs. That's what rapid iteration looks like.

Want to build the future of health and performance with AI? WHOOP is hiring engineers, product managers, and AI researchers who are passionate about using technology to unlock human performance.

From Idea To Agent In Less Than Ten Minutes

2025年10月6日 · 閱讀時間約 7 分鐘

Justin Coon

Staff AI Engineer

At WHOOP, we've been at the forefront of applied AI not by competing in the model race, but by building on top of it with our unmatched physiological data and domain expertise. While others were still debating the potential of LLMs, we had already spent years understanding how to integrate these models with real-world systems and data pipelines.

This early investment paid dividends. By the time LLMs started to become truly powerful, we weren't starting from scratch. We had years of hard-won expertise in prompt engineering, evals, model selection, and most critically, applying AI to physiological insights with robust privacy and security guardrails.

AI Studio

In the years we spent building LLM agents, we built all kinds of one-off scripts and tools that could remain flexible while we invent new ways of doing things and allow us to keep up with the gen-ai-model-of-the-month release cadence we were seeing. Integrating new data sources and rearranging logic to provide the right context to the LLMs meant that even minor changes to the way our core agents work could take weeks.

When coding agents Cursor and Claude Code came out and the coding model race really heated up, we realized we could finally build the LLM Agent IDE we'd always dreamed of. We already had a strong understanding of what the core abstractable pieces of LLM agents were (system instructions, model, and tools), we just needed to build a simple backend to store it and a simple frontend to manage it. In a week or so, we had built the first iteration of AI Studio.

The AI Studio agent editor showing system prompts, model selection, and tool configuration

It has a lot more features today than it did back then, but the original version nailed down the core need: we could now go from "what if we build an agent that does X" to actually testing that agent in less than 10 minutes. Often times someone will mention a really great idea at the start of standup and then by the end of it they had built a working prototype and sent it to all of our phones to try out. This level of rapid iteration has unlocked a creativity and pace we have never seen before. More importantly, it capitalizes on the nature of generative AI: 95% of the value can come in the first 5% of effort, and that last 5% of polish will take 95% of the effort. So we prioritize trying a lot of ideas and failing fast, never wasting time polishing something that won't work.

Growth of agent iterations over 6 months

After 6 months, we've created and tested over 2500 iterations of different agents, and safely deployed 235 of them to production across 41 live agents like WHOOP Coach or Day In Review.

The key word here is "safely." While AI Studio makes internal iteration quick and frictionless, going from idea to prototype in minutes, we haven't compromised on security or privacy when it comes to production. Every deployment goes through a built-in diff, approval, and deployment flow that ensures proper review of changes, adherence to our strict privacy policies, and validation of security guardrails. Most critically, the platform ensures that PII is never sent to model providers. This dual approach lets our teams move fast where it matters (experimentation and prototyping) while maintaining enterprise-grade security where it counts.

What Makes AI Studio Different

Traditional AI development requires deep technical expertise at every layer of the stack. With AI Studio, we've abstracted the complexity while preserving flexibility:

Visual Agent Builder: Define your agent's system prompt, select a model, and configure tools, all through a clean web interface
Integrated Testing Environment: Chat with your agent in real-time, debug interactions, and iterate on prompts without deploying anything
One-Click Tool Access: Connect to WHOOP's data ecosystem through pre-built tools for fetching or writing to things like weekly plan, healthspan, and activities with no API wrangling required
Built-in Evaluation Framework: Test agent performance systematically with our integrated eval system

The result has fundamentally changed how we think about AI at WHOOP. It's no longer a specialized capability reserved for our AI team—anyone from product managers to data scientists to health coaches can prototype agents in minutes and deploy production-ready versions within a day. In fact, with the entire development cycle now being no-code, our product team are quickly becoming our strongest prompt engineers (shoutout Anjali Ahuja, Mahek Modi, Alexi Coffey, Camerin Rawson!).

The Game Changer: Inline Tools

As we pushed deeper into making agent iteration ultra-fast, we invented something we call inline tools, a breakthrough that's transformed how we build agents.

Traditional agent architectures separate the prompt from tool invocations. The LLM has to explicitly decide when to call a tool, format the request correctly, wait for the response, and then continue. This creates latency, complexity, and countless edge cases.

Inline tools flip this approach. We've created a markup language that allows us to trigger agent tools directly inside the system prompt itself. Here's what this looks like in practice:

Current time: {{@tool1}}
Today's recovery score: {{@tool2}}

would become

Current time: <result of tool1>
Today's recovery score: <result of tool2>

When the agent loads, these inline tool calls execute in parallel, injecting real-time, personalized data directly into the context. The LLM doesn't need to "decide" to fetch this data. It's already there, reducing latency and making the interaction feel instantaneous.

This seemingly simple innovation has profound implications:

Faster Response Times: No round-trip tool calls means near-instant responses
Simpler Mental Model: Developers think about data as part of the prompt, not as external API calls
Better Consistency: Data is always present in the expected format
Easier Testing: Debug prompts with real data injected, not placeholder variables

The Compound Effect of Internal Tools

There's a broader lesson here about the compound value of internal developer tools. Every hour we invest in making AI Studio better pays dividends across every team, every agent, and every customer interaction. When you make something 10x easier, you don't just get 10x more of it. You unlock entirely new categories of innovation from people who couldn't participate before.

As LLMs continue to evolve and new models emerge, AI Studio ensures we can adopt them instantly. When a new model drops, every agent in our system can be upgraded with a dropdown selection. When we identify a new pattern that works well, it becomes a reusable component available to everyone.

This kind of leverage is now accessible to any team. Tools like Cursor and Claude Code have made it incredibly easy and low friction to spin up internal tools tailor-made to your company's stack and needs. The value compounds quickly: build once, benefit everywhere. At WHOOP, AI Studio has become the foundation for how we ship AI features—and it all started with asking what would make our own lives easier.

The Future is Already Here

While the tech world debates the future of AI agents, we're already living in it at WHOOP. Our members interact with AI agents dozens of times per day. They just experience them as helpful features, not "AI." Coach Chat provides personalized guidance. Daily Outlook surfaces insights from their data. Recovery recommendations adapt to their unique physiology.

Behind each of these experiences is an agent built in AI Studio, many of them created by people who had never worked with LLMs before. That's the real revolution: not the models themselves, but the platforms that make them accessible to everyone.

What's Next

What we've shared today is just the tip of the iceberg. The most powerful capabilities we've built into AI Studio are still under wraps—innovations that have fundamentally changed how we think about AI agents and what they're capable of.

Stay tuned. We'll be pulling back the curtain on these breakthroughs soon. And if you can't wait to see what we're building behind the scenes, join us!

Want to build the future of health and performance with AI? WHOOP is hiring engineers, product managers, and AI researchers who are passionate about using technology to unlock human performance.

Internships and Co-ops at WHOOP

2025年8月20日 · 閱讀時間約 9 分鐘

Arna Bhattacharya

Software Engineer Intern

Jacob Friedman

Software Engineer Intern

Internships and Co-ops at WHOOP are more than working on temporary projects - they give students the chance to ship real work, explore new technologies, and see their impact firsthand. In the reflections that follow, two interns share their journeys: one as a college SWE intern building core product features, and another as a high school SWE intern who transformed an idea into a redesigned engineering blog. Together, their stories highlight how WHOOP empowers interns and co-ops of all backgrounds to learn, build, and belong.

What the heck is MCP?

2025年7月28日 · 閱讀時間約 13 分鐘

Vinay Raghu

Staff Software Engineer

MCP! It’s everywhere and it's confusing. In my opinion, the usual explanations of “it’s like a USB-C adapter” assumes a certain level of proficiency in this space. If, like me, you have been struggling to keep up with the AI firehose of information and hype, this article is for you. It is a gentle introduction to MCP and all things AI so you can understand some key terminologies and architectures.

Hack Week 2025 - Build with AI

2025年7月24日 · 閱讀時間約 6 分鐘

Mark Flores

VP of Software

Last June, we hosted our biggest biggest hackathon at WHOOP yet, and our first one focused on using AI. The event was open to the entire company: product managers, designers, operations, scientists, engineers, marketing, and more. The results were beyond what I could have imagined.

Agentic Coding Without Taking Down Prod

2025年5月23日 · 閱讀時間約 7 分鐘

Viviano Cantu

Staff AI Engineer

Agentic coding is on the rise, and the productivity gains from it are real. There are tons of success stories of non-programmers "vibe coding" SaaS apps to $50k MRR. But, there are also precautionary tales of those same SaaS apps leaking API keys, racking up unnecessary server costs, and growing into giant spaghetti code that even Cursor can't fix.

Evals First: Validating GPT-5.1​

The Real Test: Production​

22% Faster Responses​

Amplified by Better Caching​

24% More Positive Feedback​

42% Lower Token Costs​

From 10% to 100% Rollout​

What We Learned​

What's Next​

AI Studio​

What Makes AI Studio Different​

The Game Changer: Inline Tools​

The Compound Effect of Internal Tools​

The Future is Already Here​

What's Next​

Evals First: Validating GPT-5.1

The Real Test: Production

22% Faster Responses

Amplified by Better Caching

24% More Positive Feedback

42% Lower Token Costs

From 10% to 100% Rollout

What We Learned

What's Next

AI Studio

What Makes AI Studio Different

The Game Changer: Inline Tools

The Compound Effect of Internal Tools

The Future is Already Here

What's Next