Back to systems

Case Study 03

Snapshot: March 2026

Multimodal perception for systems that need real peripheral vision.

The stack turns voice and video into structured signals a runtime can actually use. Instead of feeding raw chaos into a model, it produces typed state, temporal smoothing, and dissonance detection that can land in Memory Bridge today and a dedicated Perception MCP surface next.

  • voice + video fused inputs
  • typed state runtime-ready output
  • temporal smoothing layer
  • dissonance conflict detection
Abstract systems diagram for the perception stack

Core principle

Specialized perception, structured signals, adaptive reasoning.

The point is not a fancy classifier. The point is a stable signal layer that gives Claude and CC meaningful context about comprehension, fatigue, engagement, and modality disagreement without drowning the runtime in noise.

Voice stack
  • WhisperX for aligned transcription.
  • openSMILE for pitch, energy, and hesitation features.
  • pyannote and SpeechBrain for turn-taking and emotion signals.
Video stack
  • MediaPipe for face mesh, gaze, and head pose.
  • ONNX emotion inference and temporal smoothing.
  • Behavioral cues like blink rate, micro-nods, and drift.
Fusion layer
  • Normalized state vector with confidence and signal age.
  • Modality dissonance detection as a first-class output.
  • Joe-specific calibration profile planned as phase-two infrastructure.
Delivery path
  • Option 1: system prompt injection for lightweight state handoff.
  • Option 2: Memory Bridge integration for graph-native session memory.
  • Option 3: dedicated Perception MCP server with pull-on-demand tools.
Why it matters

When a runtime can distinguish verbal assent from rising confusion or fatigue, it can adapt depth, pacing, or intervention logic earlier. That changes behavior in a way plain chat context never can.

Signal inventory

Designed for typed outputs, not vibes.

  • WhisperX
  • openSMILE
  • pyannote.audio
  • SpeechBrain
  • MediaPipe
  • ONNX emotion
  • Temporal smoothing
  • Dissonance flags
  • Signal age
  • Calibration profile
  • Memory Bridge writer
  • Perception MCP

Need structured perception that can plug into a real runtime?

That is a systems and state-design problem before it is a model-selection problem.

Email Rarity Index