- AMD Threadripper PRO 7975WX (32 cores / 64 threads)
- 503GiB RAM
- Dual NVIDIA GeForce RTX 5090 (32GB each)
- 3x Samsung 990 PRO 2TB NVMe
System Deep Dive
Snapshot: March 2026
Astra, a cognitive AI platform deployed across a four-machine bare-metal fleet.
A production-grade system running on K3s across four machines with 84 CPU cores, 845GiB of RAM, 120GB of NVIDIA VRAM, a 6-layer memory architecture, real-time audio and video perception, dream-cycle consolidation, LangGraph agent orchestration, and a full Prometheus and Grafana observability surface. Not a wrapper. Not a demo. A system that runs.
Hardware
Four machines, five GPUs, and a lot more than one box.
The current Astra hardware footprint spans node01, node02, node03, and node05. Together they provide 84 CPU cores, 845GiB of RAM, and 120GB of NVIDIA VRAM for orchestration, retrieval, inference, and support workloads.
- AMD Ryzen 9 7950X3D (16 cores / 32 threads)
- 187GiB RAM
- NVIDIA GeForce RTX 3090 (24GB)
- Samsung 990 PRO with Heatsink 2TB NVMe
- AMD Ryzen 9 5900X (12 cores / 24 threads)
- 125GiB RAM
- NVIDIA GeForce RTX 5060 Ti (16GB)
- Samsung 990 PRO 2TB NVMe
- Intel Core Ultra 9 275HX (24 cores / 24 threads)
- 30GiB RAM
- NVIDIA GeForce RTX 5080 Laptop GPU (16GB)
- Samsung 990 PRO 2TB plus Samsung 1TB NVMe
Memory Architecture
Six layers from sub-millisecond to permanent.
Memory is not a feature bolted on. It is the infrastructure. Each layer has its own latency budget, retention policy, and query interface.
- Redis Sentinel HA (master + 2 replicas + 3 sentinels)
- Sub-millisecond latency for session state and pub/sub
- Allkeys-LRU eviction with AOF + RDB persistence
- Qdrant vectors with HNSW index across 100k+ memory capsules
- Neo4j knowledge graph for long-term facts and Cypher queries
- BGE embeddings and reranking with CUDA acceleration
- LangGraph checkpoints for learned skills and habits
- Neo4j relational reasoning and knowledge graph traversal
- Filesystem archive with daily backups and 7-day retention
Cognitive Engine
Dream cycles, curiosity, coherence, and synaptic pruning.
- Core logic with v2.6 ModelRegistry (lazy-load singleton, 145–415ms responses)
- Dream processor for background memory consolidation cycles
- Emotional analyzer (RoBERTa sentiment and emotion)
- Fact extractor (GLiNER NER + Phi-3 graph extraction)
- Synaptic pruning for memory decay and archival
- Curiosity engine for gap detection and question generation
- Coherence checker for contradiction detection and resolution
- Cognitive orchestrator coordinating all reasoning daemons
- Consolidation service merging short-term into long-term memory
- Ingestion daemon for continuous memory pipeline
- Agent orchestrator with LangGraph-based agentic workflows
- Document processor for PDF and structured ingestion
- Health monitor with circuit breakers and diagnostics
Perception and Media
Real-time audio, video, TTS, and vision captioning.
- Capture service with Silero VAD and Whisper STT
- Kokoro-82M TTS with GPU-accelerated synthesis
- Audio processor for transcription and feature extraction
- FFmpeg video processor (webcam at 1280x720 @ 30fps MJPEG)
- Vision caption service (LLaVA INT8 image captioning)
- Janus WebRTC gateway with monitoring daemon
- CLIP embeddings for video frame analysis
Inference
Ollama primary, vLLM secondary, GPU time-sliced.
- GPU 0: Ollama models (astra-qat 18.1GB, mem-agent, qwen2.5:3b)
- GPU 1: HuggingFace (BGE embed, BGE reranker, RoBERTa, GLiNER)
- Time-slicing: 4 replicas per GPU = 8 virtual GPU slices
- Dual sidecar architecture (qwen2.5:3b + mem-agent-turbo)
- QAT Gemma3 27B as primary reasoning model
- vLLM available with speculative decoding (disabled by default)
- Prometheus metrics with DCGM GPU exporter
- Grafana dashboards (LLM performance, fact extraction, TTS, pruning, metacognition)
- Pulse metrics API and health monitor with circuit breakers
Stack
Everything runs. Everything is deployed.
- K3s
- Longhorn
- Ollama
- vLLM
- Neo4j
- Qdrant
- Redis Sentinel
- Prometheus
- Grafana
- LangGraph
- WhisperX
- Kokoro TTS
- LLaVA
- CLIP
- BGE
- GLiNER
- RoBERTa
- Janus WebRTC
Need a cognitive system that actually runs on your own hardware?
That takes infrastructure design, not just a model API key.