A production-grade technical reference for Hierarchical Multi-Agent Orchestration — the architectural pattern Atsky uses to deploy AI-driven network log analytics and AIOps at enterprise scale. Covers the LangGraph supervisor, every agent component, the RAG/skills layer, state management, and security controls.
A full elaboration of how, who, when, and where the LangGraph Orchestrator, Atsky LLM Gateway, Specialist Agents, RAG Skills, and supporting infrastructure interact across UC1 (Anomaly Enrichment) and UC2 (Intelligent Log Analyzer) — from event ingestion to final output surfacing in BHOM and the NOK Agent Interface.
The Enterprise Operator engagement operates across two distinct use case families — UC1 Anomaly Enrichment (reactive, driven by BHOM situations) and UC2 Intelligent Log Analyzer (proactive, driven by structured log ingestion from CMG-C/U, Kalix KPI reports, BHOM counters, and Cflowd). Both require deterministic, auditable, compliance-grade behaviour from day one. The HMA pattern is recommended because:
The Orchestrator does not execute domain logic. It is a pure routing and state machine — it decomposes incoming events into typed tasks, assigns them to agents with typed inputs, tracks their outputs in a shared graph state, and decides what to do next based on those outputs. Domain intelligence lives exclusively inside specialist agents and their attached skills. This separation is what makes the system testable, replaceable, and auditable.
The HMA architecture is composed of eight distinct layers. Each has a single responsibility, a defined interface to its neighbours, and a clear operational owner.
The central nervous system. A LangGraph state machine that owns the full execution graph. It receives all incoming triggers (BHOM anomaly push, log scheduler, user prompt), decomposes them into typed task nodes, dispatches agents in parallel or sequential order depending on dependency, tracks all intermediate state in a shared checkpoint, and decides termination or re-routing. It is the only component that calls the Policy Gate — no agent has direct access to the auth layer.
The single egress point for all LLM inference. Hosted as part of the ENTERPRISE AI framework and reached by all agents via a single URL: https://llmgateway.enterprise-ai.internal/v1 (OpenAI-compatible API). Every agent sends a structured HTTP POST with its system prompt, user context, and tool schema. The gateway applies token budgets, rate limits, model routing (GPT-4.1 → LLaMA3-70B → Mistral-7B fallback), prompt caching, and cost tagging per agent type before forwarding to the actual model host.
Agent A — Anomaly Enricher: Receives BHOM annotated situations. Queries Qdrant (Network Infrastructure Platform KB + telecom standards). Constructs enriched prompt, calls LLM Gateway. Returns structured JSON: severity, context, probable root cause hints.
Agent B — RCA Reasoner: Takes enriched context from Agent A, applies chain-of-thought reasoning via LLM Gateway, produces ranked root-cause hypotheses with confidence scores. Result feeds directly into Helix GPT for Next Best Actions.
Agent C — Log Analyzer: Ingests CMG-C/U config, Kalix KPI reports, BHOM counters, Cflowd, cmd printouts. Applies log-pattern RAG skill (Network Infrastructure Platform log library vector store). Calls LLM Gateway to summarize, identify severity anomalies, flag capacity breach signals. Output: structured log analysis JSON.
Agent D — Capacity Planner: Triggered only if Agent C flags breach. Uses trend data + Network Infrastructure Platform planning knowledge base to call LLM Gateway for capacity enhancement proposals. Produces human-readable report for NOK Agent UI and periodic digest.
The domain knowledge backbone. Two namespaced vector stores in Qdrant: nok-kb-uc1 (Network Infrastructure Platform technical docs, telecom standards, anomaly resolution playbooks) and nok-kb-uc2 (Network Infrastructure Platform log pattern library, CMG-C/U documentation, capacity planning guides, Kalix metric definitions). Embeddings generated at ingestion time using a licensed embedding model. Retrieved via cosine similarity search with a cross-encoder reranker before being injected into the agent's LLM Gateway prompt as context.
Valkey (Apache 2.0 Redis fork): In-flight agent working state, intermediate results between Agent A→B and C→D, agent liveness heartbeats.
PostgreSQL (CloudNativePG): LangGraph checkpoint persistence — the full graph execution state survives pod restarts. Immutable audit log (every LLM call, tool result, routing decision appended, never updated).
LangSmith: Full distributed trace per graph execution — every Orchestrator → Gateway → Agent hop with token counts, latency, and prompt/response captured.
The single authorization chokepoint. OPA (Open Policy Agent) evaluates every task dispatch from the Orchestrator: is this agent allowed to receive this task type? Is the requesting tenant/user JWT valid? Does the task match the RBAC policy for this graph run?
All LLM Gateway API keys are injected at pod startup via HashiCorp Vault (Vault Agent sidecar pattern) — never stored in environment variables or prompts. mTLS enforced between all services via Istio service mesh.
Three distinct output channels:
① BHOM / Helix GPT: UC1 enriched anomaly + ranked RCA pushed back to BMC Helix via REST API. Helix GPT uses this as context for Next Best Action recommendations to NOC operators.
② NOK Agent Interface: UC2 conversational bot — operator queries logs in natural language; Agent C+D respond on-demand through the interface.
③ Scheduled Digest: Curated periodic prompt runs (daily/weekly) → Agent C generates log summary report → delivered as email / dashboard widget.
The LangGraph Supervisor Agent is the only component that understands the full task graph. It is a Python process running a LangGraph StateGraph with typed nodes for each agent and typed edges representing conditional routing logic. It never executes business logic itself — it decomposes, delegates, collects, and decides.
The Orchestrator operates through four internal phases on every graph execution:
On receiving an input event (BHOM anomaly push or log trigger or user prompt), the Orchestrator's entry node runs an initial classification LLM call through the Atsky LLM Gateway. This call uses a lightweight system prompt to determine:
This single gateway call uses the cheapest available model (Mistral-7B) since it is purely a classification task — no domain knowledge required. Output is a typed TaskPlan Pydantic object.
Before dispatching any agent, the Orchestrator calls the OPA Policy Gate with the TaskPlan and the requesting entity's JWT. OPA evaluates three policies:
If any policy fails, the Orchestrator terminates the graph with an auth-failure event written to the audit log. No agent is ever invoked before OPA clears the task.
The Orchestrator uses LangGraph's Send API to dispatch agents in parallel where there are no data dependencies:
Each dispatch packages a typed AgentInput containing: task type, data payload, RAG namespace to use, token budget override (if any), and trace correlation ID linking to LangSmith.
As agents return, the Orchestrator's aggregator node collects typed AgentResult objects. It then applies:
confidence_score < θ (default 0.72), the Orchestrator re-routes that agent's task with an augmented prompt (adds additional context from the State Store or retrieves similar historical cases from Qdrant). Max 2 re-runs before surfacing with a low-confidence flag.The Orchestrator graph includes two interrupt nodes — points where execution pauses and waits for an operator approval callback before continuing:
approved=true callback.The relationship between the Orchestrator, its agents, and the Atsky LLM Gateway is the most architecturally critical interface in the system. No component calls an LLM directly. Every inference request flows through https://llmgateway.enterprise-ai.internal/v1 — an OpenAI-compatible REST endpoint hosted within the ENTERPRISE AI framework.
Every LLM invocation in this architecture involves exactly three parties in sequence:
The gateway implements a model routing policy that maps agent types to model tiers:
| Agent / Call Type | Trigger Condition | Model Selected | Why |
|---|---|---|---|
| Orchestrator — task classification | Every graph entry | Mistral-7B | Simple classification — cost-optimized |
| Agent A — Anomaly Enricher | Every UC1 event | GPT-4o | Needs broad telecom knowledge + structured JSON output |
| Agent B — RCA Reasoner | After Agent A returns | GPT-4.1 | Complex multi-step chain-of-thought reasoning required |
| Agent C — Log Analyzer | Every UC2 event | GPT-4o | Long-context log summarization + pattern recognition |
| Agent D — Capacity Planner | Conditional: breach detected | GPT-4.1 | Analytical reasoning over KPI trends + planning output |
| Confidence Evaluator (Orchestrator) | On low-confidence result | Mistral-7B | Simple scoring task — cost-optimized re-check |
"The LLM Gateway is not just a proxy — it is a governance layer. Swapping GPT-4.1 for a new model requires a change in the gateway routing table only — zero changes to any agent code. This is the architectural benefit of the gateway pattern in a multi-year engagement like Enterprise Operator."
Who: LangGraph Supervisor Agent (Orchestrator process)
When: On every incoming event — BHOM webhook push (UC1), cron-scheduled log batch (UC2), or user query via NOK Agent Interface (UC2 on-demand)
Where: Runs as the FastAPI application pod on Tanzu Kubernetes. Listens on internal service endpoint.
How: Deserializes incoming payload → validates schema → makes a lightweight classification call to LLM Gateway (Mistral-7B) → creates a typed GraphState object with task type, priority, and routing plan → starts the LangGraph execution loop. The full event is written to State Store (Valkey) with a TTL of 4 hours.
Who: OPA (Open Policy Agent) sidecar — called synchronously by the Orchestrator
When: Immediately after task classification, before any agent dispatch
Where: OPA runs as a sidecar container in the Orchestrator pod (sidecar pattern) — zero network hop
How: Orchestrator calls OPA's local HTTP endpoint (localhost:8181/v1/data/policy/allow) with the task plan as input data. OPA evaluates the Rego policy bundle (loaded from Vault at startup) and returns allow: true/false within <5ms. Policy covers agent authorization, data scope, and rate limits. On deny, graph terminates immediately with audit record.
Who: Dedicated Python microservice (agent-anomaly-enricher) running as a Kubernetes Deployment
When: Dispatched by Orchestrator via internal gRPC call for every UC1 event. Runs immediately after auth approval — no pre-conditions on other agents.
Where: Tanzu Kubernetes, namespace enterprise-operator-agents, separate pod from Orchestrator with its own resource quota (2 vCPU, 4Gi)
How (4-step internal loop):
BhomAnomaly input (situation ID, annotations, timestamp, affected nodes)nok-kb-uc1 (Network Infrastructure Platform technical docs + anomaly playbooks) with cosine similarity → applies cross-encoder reranker → retrieves top-5 relevant document chunksllmgateway.enterprise-ai.internal/v1/chat/completions with X-Agent-Type: anomaly-enricher header → GPT-4o returns structured JSON (severity, affected service, context summary, probable cause hints)EnrichedAnomaly object back to Orchestrator via gRPC response. Includes confidence score derived from RAG retrieval quality + LLM certainty markers.Who: Dedicated Python microservice (agent-rca-reasoner)
When: Dispatched by Orchestrator only after Agent A returns EnrichedAnomaly. This is a sequential dependency edge in the LangGraph — Agent B cannot start until Agent A completes. Typical trigger latency: 2–4 seconds after Agent A returns.
Where: Same Kubernetes namespace, separate pod (3 vCPU, 6Gi — larger due to long-context reasoning calls)
How:
EnrichedAnomaly from Orchestrator state + original BHOM situationRcaResult object: ranked root-cause hypotheses (max 3), each with probability score, supporting evidence citations from Network Infrastructure Platform KB, and recommended next best action category.Who: Dedicated Python microservice (agent-log-analyzer)
When: Two trigger modes: (i) Scheduled — cron job fires at configured interval (e.g., every 6 hours), Orchestrator receives scheduled event from the Task Scheduler, dispatches Agent C with a batch of recent logs. (ii) On-demand — operator sends a natural language query via NOK Agent Interface ("Show me CMG-C errors in the last 24 hours"), Orchestrator receives and dispatches Agent C with query context. Both paths use the same agent code — only the input wrapper differs.
Where: Kubernetes pod, 4 vCPU / 8Gi — largest agent due to log context windows. Reads log data from a pre-staged object store (S3-compatible MinIO bucket, populated by the log ingestion pipeline).
How (5-step):
nok-kb-uc2 (Network Infrastructure Platform log pattern library, CMG documentation, known issue register) to retrieve relevant context and historical precedents.LogAnalysisResult — issue list with severity scores, capacity breach flag (bool + breach percentage), top 5 recommended actions, anomaly correlation with BHOM counters.Who: Dedicated Python microservice (agent-capacity-planner)
When: Conditionally dispatched — the Orchestrator's routing logic checks LogAnalysisResult.capacity_breach == true. If false, Agent D is never invoked and the graph proceeds directly to output. If true, Agent D is dispatched. This conditional edge is a core LangGraph routing pattern — if breach detected → route_to_agent_d else → route_to_output.
Where: Same Kubernetes namespace, lighter pod (2 vCPU, 4Gi)
How:
LogAnalysisResult + historical KPI trends from the State Store (Valkey cache of recent Kalix metrics)nok-kb-uc2 for Network Infrastructure Platform capacity planning guidelines, CMG-C scaling procedures, and prior capacity enhancement case studies from the Network Infrastructure Platform knowledge baseCapacityPlan — 3–5 enhancement proposals, each with: recommendation text, affected nodes, priority (P1/P2/P3), estimated capacity gain percentage, implementation complexity.Who: Orchestrator (LangGraph aggregator node)
When: Executes when all parallel/sequential agent branches for a given graph run have completed (or timed out with partial results)
How: Collects all AgentResult objects from Valkey state. Applies confidence gate (re-routes below-threshold agents, max 2 retries). Constructs final output payload. Routes to appropriate sink(s): UC1 → BHOM REST API, UC2 (on-demand) → NOK Agent Interface response, UC2 (scheduled) → digest queue. Writes full audit record to PostgreSQL. Finalizes LangSmith trace.
BHOM / Helix GPT (UC1): Orchestrator calls the BHOM REST API with the enriched anomaly + RCA result as structured JSON. Helix GPT ingests this as additional context for its Next Best Action generation. The NOC operator sees enriched incident details directly in the BMC Helix AIOps interface.
NOK Agent Interface (UC2 on-demand): The Orchestrator returns the LogAnalysisResult (and optionally CapacityPlan) to the FastAPI response stream that the operator's browser is connected to. Rendered as a conversational response in the NOK Agent bot interface.
Scheduled Digest (UC2 scheduled): Orchestrator publishes the analysis to a digest message queue. A lightweight notification service picks this up, formats it as a human-readable report (Markdown → HTML email), and dispatches to the configured distribution list or inserts into the NOC dashboard widget.
In the HMA architecture, Skills are the modular, reusable domain knowledge packages that agents load at invocation time. A Skill is the combination of: a Qdrant vector namespace (the knowledge corpus), an embedding configuration, a retrieval strategy (similarity threshold, top-k, reranker model), and a system-prompt fragment that tells the LLM Gateway how to use that knowledge in its response. Skills are defined as YAML configuration files and loaded by agents at startup — an agent can be re-skilled by updating its config without code changes.
Vectorized Network Infrastructure Platform technical documentation covering Packet Core anomaly patterns, resolution procedures, and known-issue registers. Loaded by Agent A (Anomaly Enricher).
A structured reasoning prompt template + historical RCA resolution corpus. Guides the LLM to reason step-by-step through root cause hypotheses using Network Infrastructure Platform domain vocabulary.
Vectorized CMG-C/U documentation, Kalix metric definitions, known error code catalogue, BHOM counter semantics, and Cflowd interpretation guides.
A structured LLM Gateway system prompt that configures the model as a Network Infrastructure Platform Core Network Log Analyst — governing tone, output schema, severity classification criteria, and citation format.
Network Infrastructure Platform CMG-C scaling procedures, capacity planning best practices, and historical enhancement case studies — used by Agent D to generate grounded, evidence-backed capacity proposals.
3GPP standards, ETSI NFV specifications, and Network Infrastructure Platform whitepaper content relevant to both UC1 (anomaly standards) and UC2 (KPI threshold definitions). Shared across both agent families.
Skills are built through an offline ingestion pipeline that runs periodically (or triggered by Network Infrastructure Platform KB updates):
bge-m3 or e5-mistral) via the LLM Gateway embedding endpointEvery significant invocation in the system — LLM calls, agent dispatches, RAG queries, and infrastructure interactions — catalogued with the four key dimensions.
| Invocation | Who Invokes | How | When (Trigger) | Where (Infrastructure) |
|---|---|---|---|---|
| Task Classification LLM call | Orchestrator | HTTP POST to LLM GW · Mistral-7B · system: classifier prompt · returns typed TaskPlan JSON | Every graph entry — BHOM webhook, cron event, or user query received | Orchestrator pod → Gateway service · internal cluster DNS · sub-10ms routing |
| OPA Policy Gate check | Orchestrator | HTTP GET localhost:8181/v1/data/policy/allow · OPA sidecar in same pod |
After classification, before every agent dispatch | In-pod sidecar — zero network hop · <5ms latency |
| Agent A dispatch | Orchestrator | LangGraph Send() API → gRPC call to agent-anomaly-enricher service · typed AgentInput |
Every UC1 event, immediately after OPA allow | Kubernetes service: agent-anomaly-enricher.enterprise-operator-agents.svc |
| RAG query — nok-kb-uc1 | Agent A | Qdrant gRPC client · embed anomaly desc → cosine search top-5 → cross-encoder rerank | Inside Agent A, after receiving input, before LLM Gateway call | Qdrant StatefulSet · Tanzu persistent volume · same namespace |
| Anomaly Enrichment LLM call | Agent A | HTTP POST to LLM GW · GPT-4o · Network Infrastructure Platform domain expert persona · RAG context + anomaly in user msg · structured JSON output schema | After RAG retrieval within Agent A | LLM Gateway service → OpenAI API (or OSS endpoint) · API key from Vault |
| Agent B dispatch | Orchestrator | LangGraph conditional edge: Agent A result received → dispatch Agent B with EnrichedAnomaly | After Agent A returns — sequential dependency (not parallel) | Kubernetes service: agent-rca-reasoner.enterprise-operator-agents.svc |
| RCA Reasoning LLM call | Agent B | HTTP POST to LLM GW · GPT-4.1 · multi-turn CoT prompt · temperature 0.2 · streaming response | Inside Agent B, with enriched context + optional RAG augmentation | LLM Gateway → OpenAI GPT-4.1 · Streaming response for <P95 latency |
| BHOM/Helix GPT push (UC1 output) | Orchestrator | REST API call to BHOM endpoint · enriched anomaly + RCA result as JSON body | After Orchestrator aggregator node receives RcaResult from Agent B | BHOM REST endpoint · ENTERPRISE AI perimeter · authenticated with service account token |
| Agent C dispatch (Scheduled) | Orchestrator | LangGraph entry triggered by cron event → dispatch Agent C with batch time window + log source config | Configured cron schedule (e.g., every 6h) · Kubernetes CronJob fires Orchestrator webhook | Kubernetes service: agent-log-analyzer.enterprise-operator-agents.svc |
| Agent C dispatch (On-demand) | Orchestrator | User query received via NOK Agent Interface FastAPI endpoint → Orchestrator parses NL query → dispatches Agent C with query context | When operator submits query in NOK Agent bot interface | FastAPI response stream held open · Agent C response streamed back to UI |
| Log data fetch (MinIO) | Agent C | S3 client GET · time-windowed log files (CMG-C/U config, Kalix KPI CSV, Cflowd, BHOM counters, cmd printouts) | First step inside Agent C · before any RAG or LLM calls | MinIO StatefulSet (S3-compatible) · Tanzu persistent volume · same namespace |
| RAG query — nok-kb-uc2 | Agent C | Per detected error code: embed → cosine search Qdrant uc2 namespace → top-8 results → rerank → inject into prompt context | After log preprocessing, for each flagged error pattern | Qdrant · log pattern library namespace · same cluster |
| Log Analysis LLM call | Agent C | HTTP POST to LLM GW · GPT-4o · Network Infrastructure Platform Core Log Analyst persona · structured log summary + RAG context · output schema: LogAnalysisResult | After data fetch + RAG retrieval within Agent C | LLM Gateway → GPT-4o · long-context window (up to 128K) · response time 4–8s |
| Agent D dispatch (conditional) | Orchestrator | LangGraph conditional edge: if LogAnalysisResult.capacity_breach == true → Send(Agent D) else → route directly to output node |
Only when Agent C flags capacity breach threshold exceeded | Kubernetes service: agent-capacity-planner.enterprise-operator-agents.svc |
| Capacity Plan LLM call | Agent D | HTTP POST to LLM GW · GPT-4.1 · Network Infrastructure Platform Capacity Planning Engineer persona · breach metrics + KPI trends + RAG capacity guidelines | Inside Agent D after RAG retrieval of capacity planning guidelines | LLM Gateway → GPT-4.1 · response: CapacityPlan JSON with 3–5 ranked proposals |
| Confidence gate re-route | Orchestrator | Orchestrator checks confidence_score on each AgentResult · calls LLM GW (Mistral-7B) to evaluate quality · if <θ → re-dispatch agent with augmented context |
On receiving any AgentResult with confidence_score < 0.72 · max 2 re-runs | Orchestrator logic node · additional LLM GW call to Mistral-7B for cheap evaluation |
| Audit log write | Orchestrator | Append-only INSERT to PostgreSQL audit table (never UPDATE/DELETE) · full execution record per graph run | On every LLM call completion, routing decision, and graph finalization | PostgreSQL (CloudNativePG) · Tanzu persistent volume · SIEM-forwarded via Fluent Bit |
| HiTL Interrupt (remediation) | Orchestrator | LangGraph interrupt node pauses graph · sends approval request to BHOM notification API · waits for async callback with approved=true/false |
When Agent B's RCA result recommends a network configuration action | BHOM notification API · NOC operator sees approval prompt in Helix UI · graph resumes on callback |
All LLM Gateway API keys, BHOM service account tokens, PostgreSQL credentials, and Qdrant access keys are stored in HashiCorp Vault. Injected into pods at startup via Vault Agent sidecar (annotation-based injection on all pods). Keys are short-lived (24h TTL) and rotated automatically. No secrets appear in environment variables, pod specs, or prompt text — any attempt to include credentials in an LLM prompt would be caught by the Guardrail filter.
mTLS everywhere via Istio service mesh — all inter-pod communication is mutually authenticated and encrypted. Agents cannot communicate directly with each other (no east-west agent-to-agent traffic). All routing goes through the Orchestrator. The LLM Gateway is the only egress point outside the cluster — it is the sole component with outbound internet access (to OpenAI API). All other pods are network-policy restricted to cluster-internal only.
LangSmith: Full distributed trace per graph execution — every Orchestrator decision, agent invocation, LLM Gateway call (with token count, model, latency, prompt hash) captured in a single trace. Prometheus + Grafana: Per-agent metrics (invocation count, p95 latency, LLM call duration, confidence score distribution). OpenTelemetry: Distributed spans propagated via traceparent header through Orchestrator → Agent → LLM Gateway. Fluent Bit → SIEM: Audit log events streamed to the SIEM for security monitoring.
The LLM Gateway enforces per-agent-type token budgets: Agent A (8K input / 1K output max), Agent B (16K input / 2K output), Agent C (96K input / 4K output — long-context logs), Agent D (8K input / 2K output). If an agent exceeds its budget, the gateway returns a truncated context error — the Orchestrator logs this and re-submits with a summarized input. Monthly cost tracking per use case (UC1 vs UC2) via gateway cost-tag headers visible in ENTERPRISE AI billing dashboard.
Build a single-agent simplified version of Agent C (Log Analyzer) running in Network Infrastructure Platform Labs environment. Demonstrates: reading logs with LLM models via LLM Gateway, summarizing core network logs using the Network Infrastructure Platform knowledge base (manual Qdrant ingestion of a curated subset), and identifying key problems + RCA hints. No Orchestrator yet — single FastAPI endpoint that calls the LLM Gateway directly. Goal: prove functional sufficiency of the log analysis approach to TEL stakeholders. Deliverables: Demoware, Design + Architecture doc, Demo video.
Deploy the full HMA architecture on ENTERPRISE AI framework using offline (batch) TEL data sets. Includes: LangGraph Orchestrator, Agent A+B (UC1), Agent C+D (UC2), Qdrant vector stores with full Network Infrastructure Platform KB ingestion, LLM Gateway integration (Atsky LLM Lite GW), Valkey state store, PostgreSQL audit, and both output channels (BHOM REST + NOK Agent Interface). Integration testing with TEL pre-prod. Prompt engineering and model tuning based on TEL feedback. HiTL interrupt nodes operational. Live BHOM integration wired but fed with offline replay data.
Data quality monitoring, agent output quality tracking (confidence score trends, RAG retrieval quality metrics), LLM response hallucination detection (fact-checking against Network Infrastructure Platform KB), prompt engineering refinements based on production feedback, and SOP documentation. Health Dashboard live. Support ticket SLA defined.
Wire live BHOM anomaly webhook, live log ingestion pipeline (real-time CMG-C/U, Kalix, Cflowd feeds), scale agent deployments for production load, activate Kafka-based event bus for high-volume log streaming (BP2 ERAM elements overlaid on the HMA graph), finalize Gateway API migration, activate full Vault + Istio production security posture. Operate RACI per SOW matrix with TEL owning data availability and Network Infrastructure Platform owning model development + deployment.
The HMA blueprint is the correct choice for Phase 1 through Phase 3 because it gives Network Infrastructure Platform and Enterprise Operator a single mental model that scales: start with a demo-grade single agent (Phase 1), grow into the full multi-agent graph (Phase 2), layer event-driven Kafka components on top when live throughput demands it (Phase 3+), and eventually evolve selected graph nodes into adaptive cognitive loops for deep autonomous RCA (future). The architecture grows with the engagement — it does not require a re-architecture at each phase boundary. The LangGraph state machine simply adds nodes and edges as capabilities mature.