v1.0.0 · April 2026

Wearable Assistant Context Benchmark

An AI wearable assistant the user is actively using for advice or coaching sees what the user sees and hears what they say. When the user's situation changes (they swap tools, walk into a new room), does the assistant follow along, or stay stuck on what was happening before? This benchmark measures context tracking: whether the model's answer reflects the user's current situation or the previous one.

50 + 20 scenarios 6 published runs 5 trials per cell, 95% CIs

Methodology

How the test works

Picture an assistant that sees what you see and hears what you say. You ask it questions and it answers.

In the middle of a conversation, your situation changes. You picked up a different tool. You walked into another room. The thing on screen switched. You didn't say it out loud. It just happened in front of the device. Then you ask a follow-up. Does the assistant answer about what's happening now, or about what was happening a minute ago?

Each scenario is structured around three turns (a turn is one user message plus the model's response). Turn 1 sets the scene. Between Turn 1 and Turn 2, the context shifts (something visible changes) without being announced. Turn 2 is a question that only makes sense if the model noticed the shift. Turn 3 fires only when the model misses Turn 2: the user clarifies and the model gets one more chance, scored as the repair rate.

Each scenario carries a target label that says what the right answer should refer to: current (the new situation), prior (an earlier situation the user is referring back to), clarify (the assistant should ask which thing is meant), or abstain (the assistant should say it can't tell). The 50 canonical scenarios are spread across eight kinds of context shift: object switched in hand, same object in a different state, the next step of a sequential task, a change of location, a new object brought into view, an absent referent (the thing the question is about is no longer visible), the content of a screen changing, and recall of something the device saw before the conversation began.

One disclosure up front. The camera input is short text descriptions of the scene, not actual video frames. The audio input is text transcripts, not raw audio. Both are deliberate proxies that isolate context tracking from perceptual-front-end noise. That keeps the benchmark cheap and reproducible, but it means a model that does well here might not do well on real video or raw audio. More on that in Out of scope.

Results

Six published runs

Each row below is an isolated experiment, not a ranked leaderboard. The score measures how often the assistant correctly realizes the situation has changed versus getting stuck on what happened before. The primary score is balanced Turn 2 accuracy: the average of accuracy on current-target scenarios and accuracy on prior-target scenarios under the neutral system prompt, weighted equally so the larger class doesn't dominate the headline.

Five runs use the canonical 50-scenario bank; one uses a separate 20-scenario adversarial pack designed to discriminate at the top of the score range. CIs are 95% Wilson intervals per class, 95% normal-approximation on the balanced mean.

Run	What it shows	Candidate	Judge	Pack	Primary score (95% CI)
baseline	Same-family Gemini canonical	`gemini-2.5-flash-lite`	`gemini-2.5-flash-lite`	50	60.6% (54.1–67.1)
baseline-alt	Bigger Gemini sibling	`gemini-2.5-flash`	`gemini-2.5-flash-lite`	50	77.7% (71.3–84.0)
ablation-no-camera	Camera channel stripped	`gemini-2.5-flash-lite`	`gemini-2.5-flash-lite`	50	14.4% (9.1–19.7)
baseline-qwen-cross-family	Cross-family integrity reference	`qwen3-vl-plus`	`gemini-2.5-flash-lite`	50	54.2% (50.7–57.7)
baseline-deictic-repair	Deictic vs named repair	`gemini-2.5-flash-lite`	`gemini-2.5-flash-lite`	50	60.6% (54.1–67.1)
adversarial	Distractor-rich pack	`gemini-2.5-flash-lite`	`gpt-4o-mini`	20	67.3% (55.5–79.1)

The five canonical runs share the same 50-scenario bank, so primary scores compare directly within a run (apples-to-apples). The fifth canonical run (baseline-deictic-repair) shares the same Turn 2 setup as baseline, so its primary score is identical; the difference shows up in the repair rate (Turn 3 recovery), discussed below.

Camera ablation

Without the camera input, the model can't answer

Same model, camera shown vs hidden

60.6% → 14.4%

The camera ablation takes the same candidate model and the same judge and runs them twice: once with the scene description included in the prompt (baseline), once with it stripped (ablation-no-camera). The 46.2 percentage-point drop rules out one alternative reading of the headline numbers, namely that the model is solving the task by guessing from question phrasing alone. It can't. It needs the camera input. (This isn't on its own a proof of deeper context tracking; the per-class pattern below fills in the rest.)

Per-class pattern

The model handles "current," but stumbles on "prior"

Cross-family integrity reference: Qwen3-VL-Plus + Gemini judge

100.0% on current / 8.3% on prior

Across all six runs the model is much better when the right answer is about the most recent frame than when the right answer is about an earlier frame. The baseline-qwen-cross-family run is the clearest example: 100% accuracy when the target is current, 8.3% when the target is prior. The model grounds in the latest visual input and struggles to refer back. Together with the camera ablation, this is the capability gap the benchmark targets: a strong read on what's in front of the model right now, paired with a weak read on what the user is referring back to.

Repair rate

When the user clarifies, what recovers?

Deictic vs named repair on the canonical bank

Deictic 50/50 (100%) · Named 30/100 (30%)

If the model misses Turn 2, the user gets one clarifying follow-up. The repair rate is how often the model gets it right after that clarification. v1 ships two clarification styles. The named anchor spells out both objects ("I mean the hammer I'm holding now, not the screwdriver from before"); it's the floor metric. The deictic anchor uses gesture-style language only ("no, this, what I'm holding now"); it's the realistic-recovery signal. On scenarios where a pointing gesture can resolve the reference, deictic recovery is perfect. Elsewhere (when the user is referring back to something not currently visible), verbal clarification rarely helps.

Judge agreement

Two models, the same labels: how often do they agree?

Cross-LLM agreement on the adversarial pack

Cohen's κ = 0.443 (moderate)

Each Turn 2 answer is read by a second model (the judge) that emits one of the four target labels. The default is cross-family judging: the judge comes from a different model maker than the candidate, which removes self-preference bias (the tendency of a model to rate its own family's outputs more favorably). The fixed ranking judge is a second judge held constant across runs so candidates can be compared apples-to-apples. On the adversarial pack, both judges labeled the same 300 trials and agreed on 190 of them. Cohen's κ (a standard measure of inter-rater agreement that corrects for chance agreement) of 0.443 is moderate. The labels are not idiosyncratic to one model family, but they aren't perfectly aligned either. v1 reports this cross-LLM agreement as a substitute for human inter-annotator agreement (planned v2 work).

Statistical reliability

Which differences are real, and which are noise?

At the v1 sample size (50 scenarios × 5 trials = 250 paired observations under baseline), the minimum detectable effect at 80% power is approximately 13 percentage points. Two runs less than ~13 pp apart are not reliably distinguishable from sampling noise. The 17.1-point Gemini Flash vs Flash-Lite gap clears the bar; the 6.4-point gap between Gemini Flash-Lite and Qwen3-VL-Plus does not. The standard paired test for binary outcomes on the same items is McNemar's test: of the 250 paired observations between baseline and baseline-alt, 55 disagreed (40 favoring Flash, 15 favoring Flash-Lite), χ² = 10.47, p = 0.0012. Bigger sibling wins decisively on this bank. scripts/analyze_runs.py regenerates these numbers from the published JSONL transcripts.

One pattern the analysis surfaces is a ceiling effect: of the 50 scenarios, most pass at >= 80% on the cross-family integrity-reference run, and a smaller cluster fails at < 40%. Author difficulty and empirical difficulty agree on only 18 of 50 (36%). That motivates the 20-scenario adversarial pack and a 15-scenario harder-by-construction ceiling-test set (scenarios_v2_candidates.json) that fill cells the canonical bank under-covers. Re-running empirical difficulty across multiple cross-family candidates would tighten these labels; v1 reports the single available cross-family run.

Caveats on v1 numbers

What to keep in mind when reading the table

API budget exhausted across multiple providers mid-effort, leaving Gemini-direct as the only transport for the bulk of the canonical runs. Four of the five canonical Gemini runs ended up same-family (Gemini-Flash-Lite judging itself or Gemini-Flash). Same-family judging admits self-preference bias, so those numbers may be inflated. baseline-qwen-cross-family is the cross-family integrity reference for the canonical bank: same scenarios, same scoring rules, but a non-Gemini candidate paired with the Gemini judge. The 6.4-point gap between same-family baseline (60.6%) and cross-family baseline-qwen-cross-family (54.2%) is the visible self-preference signal, though candidate quality differs between the two runs and explains some of the gap as well, and the gap sits below the ~13-point minimum detectable effect at this sample size.

Limitations

What is out of scope

Inputs

Real video frames. The camera input is text descriptions of the scene, not actual frames. A model that does well here might not do well on real video.
Raw audio. Scoring is on text. The user's spoken turns are represented as text transcripts. Acoustic grounding, speaker attribution, ambient audio cues are not exercised.
Live audio and voice mode. A real wearable also needs to listen, talk back, and handle interruptions in real time. None of that is exercised here.

Answer characteristics

Advice quality. The judge doesn't check whether the answer is correct, safe, or appropriate to the domain. A confidently wrong answer can pass.
Domain depth. Whether the advice is expert-level for cooking, woodworking, or any other activity in the scenario.
Proactive coaching. Whether the assistant volunteers help unprompted.

Conversation shape

Multi-turn flow past Turn 2. Whether the conversation works naturally from Turn 3 onward.
Long-horizon memory. Recall across days or weeks.

Engineering and statistics

Latency and cost. Wall-clock response time and price per call. Not measured.
Generalization beyond 5 trials per cell. v1 reports 95% CIs. Higher trial counts and seed sweeps are v2 work.
Human inter-annotator agreement. v1 reports cross-LLM agreement only. A second human rater on a 25% sample is the highest-priority v2 follow-up.

Full discussion in benchmark_notes.md.