Back to systems

System Deep Dive

Snapshot: March 2026

Astra, a cognitive AI platform deployed across a four-machine bare-metal fleet.

A production-grade system running on K3s across four machines with 84 CPU cores, 845GiB of RAM, 120GB of NVIDIA VRAM, a 6-layer memory architecture, real-time audio and video perception, dream-cycle consolidation, LangGraph agent orchestration, and a full Prometheus and Grafana observability surface. Not a wrapper. Not a demo. A system that runs.

  • 4 machines in footprint
  • 84 CPU cores
  • 120GB NVIDIA VRAM
  • 845GiB system RAM
Abstract systems diagram for Astra

Hardware

Four machines, five GPUs, and a lot more than one box.

The current Astra hardware footprint spans node01, node02, node03, and node05. Together they provide 84 CPU cores, 845GiB of RAM, and 120GB of NVIDIA VRAM for orchestration, retrieval, inference, and support workloads.

node01
  • AMD Threadripper PRO 7975WX (32 cores / 64 threads)
  • 503GiB RAM
  • Dual NVIDIA GeForce RTX 5090 (32GB each)
  • 3x Samsung 990 PRO 2TB NVMe
node02
  • AMD Ryzen 9 7950X3D (16 cores / 32 threads)
  • 187GiB RAM
  • NVIDIA GeForce RTX 3090 (24GB)
  • Samsung 990 PRO with Heatsink 2TB NVMe
node03
  • AMD Ryzen 9 5900X (12 cores / 24 threads)
  • 125GiB RAM
  • NVIDIA GeForce RTX 5060 Ti (16GB)
  • Samsung 990 PRO 2TB NVMe
node05
  • Intel Core Ultra 9 275HX (24 cores / 24 threads)
  • 30GiB RAM
  • NVIDIA GeForce RTX 5080 Laptop GPU (16GB)
  • Samsung 990 PRO 2TB plus Samsung 1TB NVMe

Memory Architecture

Six layers from sub-millisecond to permanent.

Memory is not a feature bolted on. It is the infrastructure. Each layer has its own latency budget, retention policy, and query interface.

Layer 1 — Working Memory
  • Redis Sentinel HA (master + 2 replicas + 3 sentinels)
  • Sub-millisecond latency for session state and pub/sub
  • Allkeys-LRU eviction with AOF + RDB persistence
Layers 2–3 — Episodic and Semantic
  • Qdrant vectors with HNSW index across 100k+ memory capsules
  • Neo4j knowledge graph for long-term facts and Cypher queries
  • BGE embeddings and reranking with CUDA acceleration
Layers 4–6 — Procedural, Graph, Archive
  • LangGraph checkpoints for learned skills and habits
  • Neo4j relational reasoning and knowledge graph traversal
  • Filesystem archive with daily backups and 7-day retention

Cognitive Engine

Dream cycles, curiosity, coherence, and synaptic pruning.

signal_core
  • Core logic with v2.6 ModelRegistry (lazy-load singleton, 145–415ms responses)
  • Dream processor for background memory consolidation cycles
  • Emotional analyzer (RoBERTa sentiment and emotion)
  • Fact extractor (GLiNER NER + Phi-3 graph extraction)
  • Synaptic pruning for memory decay and archival
  • Curiosity engine for gap detection and question generation
  • Coherence checker for contradiction detection and resolution
Running services
  • Cognitive orchestrator coordinating all reasoning daemons
  • Consolidation service merging short-term into long-term memory
  • Ingestion daemon for continuous memory pipeline
  • Agent orchestrator with LangGraph-based agentic workflows
  • Document processor for PDF and structured ingestion
  • Health monitor with circuit breakers and diagnostics

Perception and Media

Real-time audio, video, TTS, and vision captioning.

Audio and voice
  • Capture service with Silero VAD and Whisper STT
  • Kokoro-82M TTS with GPU-accelerated synthesis
  • Audio processor for transcription and feature extraction
Video and vision
  • FFmpeg video processor (webcam at 1280x720 @ 30fps MJPEG)
  • Vision caption service (LLaVA INT8 image captioning)
  • Janus WebRTC gateway with monitoring daemon
  • CLIP embeddings for video frame analysis

Inference

Ollama primary, vLLM secondary, GPU time-sliced.

GPU allocation
  • GPU 0: Ollama models (astra-qat 18.1GB, mem-agent, qwen2.5:3b)
  • GPU 1: HuggingFace (BGE embed, BGE reranker, RoBERTa, GLiNER)
  • Time-slicing: 4 replicas per GPU = 8 virtual GPU slices
Model strategy
  • Dual sidecar architecture (qwen2.5:3b + mem-agent-turbo)
  • QAT Gemma3 27B as primary reasoning model
  • vLLM available with speculative decoding (disabled by default)
Observability
  • Prometheus metrics with DCGM GPU exporter
  • Grafana dashboards (LLM performance, fact extraction, TTS, pruning, metacognition)
  • Pulse metrics API and health monitor with circuit breakers

Stack

Everything runs. Everything is deployed.

  • K3s
  • Longhorn
  • Ollama
  • vLLM
  • Neo4j
  • Qdrant
  • Redis Sentinel
  • Prometheus
  • Grafana
  • LangGraph
  • WhisperX
  • Kokoro TTS
  • LLaVA
  • CLIP
  • BGE
  • GLiNER
  • RoBERTa
  • Janus WebRTC

Need a cognitive system that actually runs on your own hardware?

That takes infrastructure design, not just a model API key.

Email Rarity Index