- WhisperX for aligned transcription.
- openSMILE for pitch, energy, and hesitation features.
- pyannote and SpeechBrain for turn-taking and emotion signals.
Case Study 03
Snapshot: March 2026
Multimodal perception for systems that need real peripheral vision.
The stack turns voice and video into structured signals a runtime can actually use. Instead of feeding raw chaos into a model, it produces typed state, temporal smoothing, and dissonance detection that can land in Memory Bridge today and a dedicated Perception MCP surface next.
Core principle
Specialized perception, structured signals, adaptive reasoning.
The point is not a fancy classifier. The point is a stable signal layer that gives Claude and CC meaningful context about comprehension, fatigue, engagement, and modality disagreement without drowning the runtime in noise.
- MediaPipe for face mesh, gaze, and head pose.
- ONNX emotion inference and temporal smoothing.
- Behavioral cues like blink rate, micro-nods, and drift.
- Normalized state vector with confidence and signal age.
- Modality dissonance detection as a first-class output.
- Joe-specific calibration profile planned as phase-two infrastructure.
- Option 1: system prompt injection for lightweight state handoff.
- Option 2: Memory Bridge integration for graph-native session memory.
- Option 3: dedicated Perception MCP server with pull-on-demand tools.
When a runtime can distinguish verbal assent from rising confusion or fatigue, it can adapt depth, pacing, or intervention logic earlier. That changes behavior in a way plain chat context never can.
Signal inventory
Designed for typed outputs, not vibes.
- WhisperX
- openSMILE
- pyannote.audio
- SpeechBrain
- MediaPipe
- ONNX emotion
- Temporal smoothing
- Dissonance flags
- Signal age
- Calibration profile
- Memory Bridge writer
- Perception MCP
Need structured perception that can plug into a real runtime?
That is a systems and state-design problem before it is a model-selection problem.