Building a Predictive Maintenance System for Industrial Equipment: Our Reference Architecture
The Challenge
Industrial operations teams sit on two things at once: equipment that's expensive to let fail, and technical documentation that's hard to search when something goes wrong. A maintenance engineer facing a fault code at 2am doesn't have time to read 400 pages of manuals. The symptoms they're seeing need to match against specifications, fault-code tables, and historical maintenance logs — fast, and with reasoning they can verify. Most predictive maintenance tools either oversimplify ("this pump will fail in 14 days") or dump raw retrieval results on the user. Neither is how engineers actually work. They need a system that retrieves the right context, reasons through it like a senior technician would, and flags the severity clearly enough that a work order gets created automatically when it matters.
Our Solution
We built a reference implementation to prove out the architecture end-to-end before adapting it for a specific client environment. The system diagnoses equipment faults for 50 simulated industrial units — turbofans running on NASA CMAPSS degradation data, plus pumps and compressors with deterministic simulation. The retrieval layer uses Hybrid RAG: vector search through ChromaDB for semantic matches, BM25 keyword search for exact fault codes and part numbers, fused via Reciprocal Rank Fusion. Fault-code queries get a priority boost because engineers searching for "F-237" need that exact match first, not a semantic approximation of it. A cross-encoder reranker trims the noise before context hits the LLM. The reasoning layer is a LangGraph agent: health check → retrieve context → diagnose → (if severity is HIGH or CRITICAL) create work order → respond. The diagnostic node expects strict JSON from the LLM and falls back gracefully when parsing fails — because LLM output in production is never as clean as it is in demos. The system runs on Ollama locally by default, with graceful fallback to Anthropic Claude or OpenAI if local inference isn't available. This matters for clients with data-residency requirements who can't send equipment data to third-party APIs.
What This Proves
- Hybrid retrieval outperforms vector-only for industrial queries. Engineers search with mixed intent — part numbers, fault codes, symptom descriptions — and each needs a different retrieval strategy. RRF fusion handles all three without manual query routing.
- Agentic workflows make AI decisions auditable. Every diagnostic session saves its agent steps to the database, so an engineer or auditor can see exactly which sources were retrieved, what the model concluded, and why a work order was or wasn't created.
- RAGAS evaluation runs locally. Faithfulness, Answer Relevancy, Context Precision, and Context Recall — all measurable without sending evaluation data to external APIs. Critical for regulated industries.
- The architecture holds up under real degradation data. NASA CMAPSS turbofan run-to-failure data tests the system on patterns that look like real equipment aging, not synthetic examples.
What a Real Engagement Looks Like
For a production deployment, we adapt this reference build to the client's environment over 6–8 weeks:
Week 1–2: Ingest the client's equipment manuals, fault-code references, and historical maintenance logs into the RAG index. Tune chunk sizes and retrieval weights for their documentation style.
Week 3–4: Integrate with the client's sensor data pipeline. Calibrate the health score and RUL estimation for their specific equipment classes.
Week 5–6: Configure work order routing to integrate with the client's CMMS (SAP PM, Maximo, or similar). Set up role-based access matching their ops structure.
Week 7–8: Evaluation round. RAGAS metrics plus engineer feedback. Tune, re-evaluate, ship.
The reference build exists so we're not prototyping architecture during a paid engagement — we're adapting a proven pattern.
Outcomes
5 nodes
LangGraph agentic workflow
2 retrieval modes
Hybrid RAG with RRF fusion
4 metrics
RAGAS evaluation, fully local
50 units
Tested on NASA CMAPSS degradation data
Ready to achieve similar results?
Start with a focused PoC and see the value in your own operations.
Start a Conversation