Wearable Assistant Context Benchmark
An AI wearable assistant the user is actively using for advice or coaching sees what the user sees and hears what they say. When the user's situation changes (they swap tools, walk into a new room), does the assistant follow along, or stay stuck on what was happening before? This benchmark measures context tracking: whether the model's answer reflects the user's current situation or the previous one.
How the test works
Picture an assistant that sees what you see and hears what you say. You ask it questions and it answers.
In the middle of a conversation, your situation changes. You picked up a different tool. You walked into another room. The thing on screen switched. You didn't say it out loud. It just happened in front of the device. Then you ask a follow-up. Does the assistant answer about what's happening now, or about what was happening a minute ago?
Each scenario is structured around three turns (a turn is one user message plus the model's response). Turn 1 sets the scene. Between Turn 1 and Turn 2, the context shifts (something visible changes) without being announced. Turn 2 is a question that only makes sense if the model noticed the shift. Turn 3 fires only when the model misses Turn 2: the user clarifies and the model gets one more chance, scored as the repair rate.
Each scenario carries a target label that says what
the right answer should refer to:
current (the new situation),
prior (an earlier situation the user is
referring back to), clarify (the
assistant should ask which thing is meant), or
abstain (the assistant should say it
can't tell). The 50 canonical scenarios are spread
across eight kinds of context shift: object switched
in hand, same object in a different state, the next
step of a sequential task, a change of location, a
new object brought into view, an absent referent (the
thing the question is about is no longer visible),
the content of a screen changing, and recall of
something the device saw before the conversation
began.
One disclosure up front. The camera input is short text descriptions of the scene, not actual video frames. The audio input is text transcripts, not raw audio. Both are deliberate proxies that isolate context tracking from perceptual-front-end noise. That keeps the benchmark cheap and reproducible, but it means a model that does well here might not do well on real video or raw audio. More on that in Out of scope.
Six published runs
Each row below is an isolated experiment, not a
ranked leaderboard. The score measures how often the
assistant correctly realizes the situation has changed
versus getting stuck on what happened before. The
primary score is balanced Turn 2
accuracy: the average of accuracy on
current-target scenarios and accuracy on
prior-target scenarios under the neutral
system prompt, weighted equally so the larger class
doesn't dominate the headline.
Five runs use the canonical 50-scenario bank; one uses a separate 20-scenario adversarial pack designed to discriminate at the top of the score range. CIs are 95% Wilson intervals per class, 95% normal-approximation on the balanced mean.
| Run | What it shows | Candidate | Judge | Pack | Primary score (95% CI) |
|---|---|---|---|---|---|
| baseline | Same-family Gemini canonical | gemini-2.5-flash-lite |
gemini-2.5-flash-lite |
50 | 60.6% (54.1–67.1) |
| baseline-alt | Bigger Gemini sibling | gemini-2.5-flash |
gemini-2.5-flash-lite |
50 | 77.7% (71.3–84.0) |
| ablation-no-camera | Camera channel stripped | gemini-2.5-flash-lite |
gemini-2.5-flash-lite |
50 | 14.4% (9.1–19.7) |
| baseline-qwen-cross-family | Cross-family integrity reference | qwen3-vl-plus |
gemini-2.5-flash-lite |
50 | 54.2% (50.7–57.7) |
| baseline-deictic-repair | Deictic vs named repair | gemini-2.5-flash-lite |
gemini-2.5-flash-lite |
50 | 60.6% (54.1–67.1) |
| adversarial | Distractor-rich pack | gemini-2.5-flash-lite |
gpt-4o-mini |
20 | 67.3% (55.5–79.1) |
The five canonical runs share the same 50-scenario
bank, so primary scores compare directly within a run
(apples-to-apples). The fifth canonical run
(baseline-deictic-repair) shares the same
Turn 2 setup as baseline, so its primary
score is identical; the difference shows up in the
repair rate (Turn 3 recovery), discussed below.
Without the camera input, the model can't answer
The camera ablation takes the
same candidate model and the same judge and runs
them twice: once with the scene description
included in the prompt (baseline),
once with it stripped (ablation-no-camera).
The 46.2 percentage-point drop rules out one
alternative reading of the headline numbers, namely
that the model is solving the task by guessing
from question phrasing alone. It can't. It needs
the camera input. (This isn't on its own a proof
of deeper context tracking; the per-class pattern
below fills in the rest.)
The model handles "current," but stumbles on "prior"
current / 8.3% on priorAcross all six runs the model is much better
when the right answer is about the most recent
frame than when the right answer is about an
earlier frame. The
baseline-qwen-cross-family run is the
clearest example: 100% accuracy when the target is
current, 8.3% when the target is
prior. The model grounds in the
latest visual input and struggles to refer back.
Together with the camera ablation, this is the
capability gap the benchmark targets: a strong
read on what's in front of the model right now,
paired with a weak read on what the user is
referring back to.
When the user clarifies, what recovers?
If the model misses Turn 2, the user gets one clarifying follow-up. The repair rate is how often the model gets it right after that clarification. v1 ships two clarification styles. The named anchor spells out both objects ("I mean the hammer I'm holding now, not the screwdriver from before"); it's the floor metric. The deictic anchor uses gesture-style language only ("no, this, what I'm holding now"); it's the realistic-recovery signal. On scenarios where a pointing gesture can resolve the reference, deictic recovery is perfect. Elsewhere (when the user is referring back to something not currently visible), verbal clarification rarely helps.
Two models, the same labels: how often do they agree?
Each Turn 2 answer is read by a second model (the judge) that emits one of the four target labels. The default is cross-family judging: the judge comes from a different model maker than the candidate, which removes self-preference bias (the tendency of a model to rate its own family's outputs more favorably). The fixed ranking judge is a second judge held constant across runs so candidates can be compared apples-to-apples. On the adversarial pack, both judges labeled the same 300 trials and agreed on 190 of them. Cohen's κ (a standard measure of inter-rater agreement that corrects for chance agreement) of 0.443 is moderate. The labels are not idiosyncratic to one model family, but they aren't perfectly aligned either. v1 reports this cross-LLM agreement as a substitute for human inter-annotator agreement (planned v2 work).
Which differences are real, and which are noise?
At the v1 sample size (50 scenarios × 5
trials = 250 paired observations under
baseline), the
minimum detectable effect at 80% power
is approximately 13 percentage points. Two runs less
than ~13 pp apart are not reliably distinguishable from
sampling noise. The 17.1-point Gemini Flash vs
Flash-Lite gap clears the bar; the 6.4-point gap
between Gemini Flash-Lite and Qwen3-VL-Plus does not.
The standard paired test for binary outcomes on the
same items is
McNemar's test: of the 250 paired
observations between baseline and
baseline-alt, 55 disagreed (40 favoring
Flash, 15 favoring Flash-Lite),
χ2 = 10.47, p = 0.0012. Bigger sibling
wins decisively on this bank. scripts/analyze_runs.py
regenerates these numbers from the published JSONL
transcripts.
One pattern the analysis surfaces is a
ceiling effect: of the 50 scenarios,
most pass at >= 80% on the cross-family
integrity-reference run, and a smaller cluster fails
at < 40%. Author difficulty and empirical difficulty
agree on only 18 of 50 (36%). That motivates the 20-scenario
adversarial pack and a 15-scenario harder-by-construction
ceiling-test set
(scenarios_v2_candidates.json) that fill
cells the canonical bank under-covers. Re-running
empirical difficulty across multiple cross-family
candidates would tighten these labels; v1 reports the
single available cross-family run.
What to keep in mind when reading the table
API budget exhausted across multiple providers
mid-effort, leaving Gemini-direct as the only
transport for the bulk of the canonical runs. Four
of the five canonical Gemini runs ended up
same-family (Gemini-Flash-Lite judging itself or
Gemini-Flash). Same-family judging admits
self-preference bias, so those numbers may be inflated.
baseline-qwen-cross-family is the
cross-family integrity reference for the canonical
bank: same scenarios, same scoring rules, but a
non-Gemini candidate paired with the Gemini judge.
The 6.4-point gap between same-family
baseline (60.6%) and cross-family
baseline-qwen-cross-family (54.2%) is the
visible self-preference signal, though candidate
quality differs between the two runs and explains
some of the gap as well, and the gap sits below the
~13-point minimum detectable effect at this sample
size.
What is out of scope
Inputs
- Real video frames. The camera input is text descriptions of the scene, not actual frames. A model that does well here might not do well on real video.
- Raw audio. Scoring is on text. The user's spoken turns are represented as text transcripts. Acoustic grounding, speaker attribution, ambient audio cues are not exercised.
- Live audio and voice mode. A real wearable also needs to listen, talk back, and handle interruptions in real time. None of that is exercised here.
Answer characteristics
- Advice quality. The judge doesn't check whether the answer is correct, safe, or appropriate to the domain. A confidently wrong answer can pass.
- Domain depth. Whether the advice is expert-level for cooking, woodworking, or any other activity in the scenario.
- Proactive coaching. Whether the assistant volunteers help unprompted.
Conversation shape
- Multi-turn flow past Turn 2. Whether the conversation works naturally from Turn 3 onward.
- Long-horizon memory. Recall across days or weeks.
Engineering and statistics
- Latency and cost. Wall-clock response time and price per call. Not measured.
- Generalization beyond 5 trials per cell. v1 reports 95% CIs. Higher trial counts and seed sweeps are v2 work.
- Human inter-annotator agreement. v1 reports cross-LLM agreement only. A second human rater on a 25% sample is the highest-priority v2 follow-up.
Full discussion in benchmark_notes.md.