Multi-Agent Framework
Evolution Architecture
A complete engineering dissection of Production Infrastructure's Multi-Agent Framework stack — its architecture, runtime model, multi-agent capability, the three production gaps that actually matter, and the four-phase path to a production-grade orchestration platform for TEL–NOK UC1 & UC2.
/ 10 weighted
Phase 1 complete
to close
to Phase 3
What Multi-Agent Framework Actually Is
Multi-Agent Framework is Production Infrastructure's internal "manager layer" wrapping the OpenAI Agents SDK. It exposes an OpenAI-compatible HTTP API and allows composing prompts, tools, MCP servers, and multi-agent handoff graphs entirely via YAML config — no code changes needed. A single FastAPI + Uvicorn process handles everything.
cmg_multi_agent.yaml is a working Production Infrastructure CMG anomaly detection multi-agent config (supervisor → AD agent → RCA agent) that maps directly to UC1 requirements. Additional proven configs: config_multi_agents.yaml (web research crew), config_tshark_multi_agent.yaml, config_coding_multi_agent.yaml. This is months of Production Infrastructure-specific domain work that would require full reconstruction in any greenfield alternative — the primary reason to evolve rather than replace.
Anatomy of the Stack
Sixteen distinct components across server, API, agent graph, session, storage, and observability layers — each with a clearly defined single responsibility.
| Component | Location | Role |
|---|---|---|
| CLI Entrypoint | server/main.py | Multi-Agent Framework serve CLI — builds config overlay, starts Uvicorn with provided args |
| App Factory + Lifespan | server/core/app/lifespan.py | Startup: load config, init LLM client, build all agents from YAML, init MCP connections. Shutdown: teardown sessions and storage backends |
| AppState | server/core/config.py | Process-scoped singleton: active agent, agents_by_id dict, sessions dict (in-memory LRU), parsed YAML config, engine registry. Not shared across processes. |
| Chat Completions | server/api/chat/main.py | Request entry: parse_chat_request(), passthrough check, slash command routing, stream / non-stream delegation |
| Streaming Handler | server/api/chat/streaming.py | SSE stream construction, session resolution, history compression trigger, SDK event → OpenAI SSE translation |
| Agent Graph Builder | server/agents/graph/main.py | Constructs all Agent objects from YAML: personas, tool grants, model overrides, MCP server selection per agent |
| Handoff Engine | server/agents/handoff.py | Multi-Agent FrameworkHandoff: builds [AGENT SWITCH] + [HANDOFF TASK] transfer messages, captures via ContextVar, applies input filters to control target-agent context |
| Handoff Wiring | server/agents/graph/handoff_wiring.py | attach_handoffs(): builds directed edges between agents based on YAML handoffs: config |
| A2A Executor | server/a2a/executor.py | Adapter: translates Multi-Agent Framework streaming output into A2A Task events for cross-service agent calls via /a2a/v1 |
| Runtime Context | common/core/runtime.py | ContextVars for session ID, sink, call IDs. Module-level globals: _client, _settings, _tool_call_agent_map — not safe across processes |
| Session Factory | common/session/factory.py | Creates SQLAlchemy sessions (SQLite / PG / MySQL) per session_key. Manages connection pool lifecycle |
| Session Recovery | common/session/recovery/ | Checkpoint detection, anomaly detection (orphaned tool calls, truncated responses), rollback via pop_item(). Handles DB-level anomalies — not logic failures. |
| Delegation Helper | common/core/delegation.py | Tool inheritance for delegate / sub-agents. Enables run_agent tool for inline agent cloning and dynamic task delegation |
| YAML Config Loader | common/config/ | Multi-file import, environment variable substitution, ${ref} resolution across config files |
| Tools | tools/ | Filesystem, DB, Kubernetes, Network, Math, Web search, AI sub-agents — accessible via per-agent YAML tool grants |
| Langfuse Tracing | common/tracing/ | Logfire-based Langfuse integration: captures every LLM call, prompt, response, token count, latency, cost per agent type |
How a Request Flows
Two execution paths: a single-turn request and a multi-agent handoff sequence. Both run within the same single asyncio event loop inside one Python process.
// Single-Turn Request Flow
a. Resolve or create Session (in-memory dict or SQLAlchemy, keyed by session_key)
b. maybe_compress_session() — if history token count exceeds compress_threshold, a summarizer agent call runs in-band
c. Runner.run_streamed(agent, input, session, run_config) — Agent LLM call → parallel tool dispatch via asyncio.gather → results fed back → next LLM call → stream events emitted
d. consume_stream() — translates SDK events (text_delta, tool_call, tool_result, reasoning, handoff) to OpenAI SSE format
e. Session written back to DB; LRU eviction policy enforced on session_access_order
// Multi-Agent Handoff Flow
passthrough — full conversation history passed to target agent unchanged
strip_tools — tool call/result blocks removed before passing (reduces token usage)
last_turn — only the most recent turn passed (minimal context transfer)
nest_handoff_history — previous agent's full context nested as a structured block in the new session
What Works — and What Doesn't
Multi-Agent Framework has real, production-tested multi-agent capability across sequential handoffs, supervisor patterns, per-agent isolation, and protocol exposure. Three patterns critical for TEL–NOK production are structurally absent.
handoffs: config. Proven in cmg_multi_agent.yaml (Production Infrastructure CMG AD → RCA)llm: section with model, temperature, max_tokens overridesrun_agent tool for inline agent cloning and dynamic task delegation within a sessionPresent vs. Absent
A clear-eyed inventory: what the stack includes today, and what is missing for production-scale multi-instance deployment on TEL's TKG environment.
The absent items above interact: single-process AppState + no shared session store means you cannot run two Multi-Agent Framework pods and load-balance between them even if you add nginx. The session state would be split across processes. This must be resolved (Valkey shared state) before any other scaling infrastructure is added — it is the foundational blocker.
The Three Gaps That Actually Matter
Multi-Agent Framework has twelve documented production absences. For TEL–NOK Phase 2 MVP and Phase 3 production, only three create material delivery risk. The rest are Phase 3 infrastructure concerns. Resolving all twelve upfront adds 16+ weeks with no Phase 2 benefit.
Missing Redis, no API gateway, no WebSockets — these only become blockers at Phase 3 production scale with live data pipelines. For Phase 2 MVP on offline data, the three gaps below are the only ones that create immediate risk to delivery quality or operator safety in a Production Infrastructure NOC context.
Multi-agent flow in Multi-Agent Framework is entirely LLM-driven: the SDK Runner relies on the model calling the correct transfer_to_<target> tool at the right moment. There is no Python-level workflow state machine enforcing execution order. No code guarantee that "anomaly detected → always call RCA agent." It is a suggestion in the system prompt — not an architectural contract.
For a Production Infrastructure NOC environment, a missed RCA step or a capacity breach not flagged because the LLM chose not to hand off is a production incident. Session recovery (common/session/recovery/) handles DB-level anomalies like orphaned tool calls — it cannot recover a logic failure where the LLM simply didn't invoke the handoff.
Additionally: history compression via an in-band summarizer agent (triggered when history exceeds compress_threshold) is itself an unguarded LLM call — if the summarizer degrades context, the entire session reasoning quietly deteriorates with no alert surfaced.
LLM must invoke transfer_to_rca_agent. If it miscategorises the anomaly, or context-window pressure causes it to skip the handoff tool, Agent B (RCA Reasoner) never runs. No retry. No audit record of the missed step. The operator receives an incomplete enrichment with no error surfaced at all.
LangGraph conditional edge: after Agent A returns an EnrichedAnomaly typed object, Python routing logic checks the result type and unconditionally routes to Agent B. Zero LLM compliance required. Confidence gate: if result.confidence < 0.72 → re-route with augmented RAG context, max 2 retries, then human escalation via interrupt().
AppState is a process-level singleton. The sessions dict, session_engines, and session_access_order are all in-memory Python dicts. Module-level globals in runtime.py — _client, _settings, _tool_call_agent_map — are not safe across OS processes. This is not a configuration limitation; it is a fundamental architectural constraint baked into how AppState is initialised.
You cannot run multiple Multi-Agent Framework instances behind a load balancer without sessions being lost or requests landing on the wrong instance. At Phase 3 production volumes — live BHOM anomaly streams plus live CMG-C/U log ingestion — a single pod will saturate. Pod restart = in-flight task loss. No checkpoint survives a process restart beyond the SQLAlchemy session write, which only captures completed turns.
Single FastAPI process. All session state in AppState.sessions in-memory dict. Kubernetes pod eviction during a 40s UC2 log analysis call = task permanently lost. SQLite default: single-writer, cannot be shared across pods even if memory were resolved. No KEDA autoscaling possible.
Valkey (Redis-compatible, Apache 2.0) as shared working state across all agent pods — replaces in-memory AppState.sessions. CloudNativePG as LangGraph checkpoint store: graph state survives pod restarts and Kubernetes evictions. KEDA scales Agent C pods on log.raw Kafka topic consumer lag independently of Agent A pods.
The OpenAI Agents SDK Runner executes agents sequentially in the handoff chain: Agent A suspends; Agent B runs; Agent B suspends; Agent A resumes. There is no fork-join mechanism and no simultaneous parallel agent branches. A supervisor wanting two specialists to work concurrently must wait for them sequentially.
For combined UC1 + UC2 events: total processing time = UC1 chain time + UC2 chain time, not max(UC1, UC2). Agent C (Log Analyzer) processes 96K-token log batches and takes 20–40 seconds per LLM call. A P1 BHOM anomaly arriving during a UC2 log analysis run is queued behind that long-running call — it cannot be dispatched to a parallel Agent A branch.
Combined UC1+UC2: Agent A→B (~10s) finishes, then Agent C→D (~40s) runs. Total: ~50s per event. P1 anomalies cannot preempt in-progress UC2 log analysis. Single asyncio event loop serialises all agent transitions within the process.
LangGraph parallel Send(): UC1 and UC2 chains dispatched simultaneously to independent agent pod pools. Total: max(10s, 40s) = ~40s. Agent A and Agent C pod counts scale independently via KEDA on their respective event sources. P1 anomalies always get a free Agent A pod.
Multi-Agent Framework's built-in RAG uses sqlite-vec (TinySearch) — a local SQLite-based vector store on the pod's filesystem. It works for Phase 1 demo but is architecturally incompatible with Phase 2 multi-pod deployment: the file is pod-local, a second Agent A pod cannot query the same index, there is no namespace isolation between UC1 and UC2 knowledge bases, no cross-encoder reranker support, and a practical capacity ceiling around 100K chunks.
Pod-local file. Single-instance only. UC1 and UC2 Production Infrastructure KB share one search space with no isolation. No reranker. Phase 1 demo only. Immediately broken in any multi-pod deployment — second pod has an empty or stale index.
Standalone Qdrant service on TKG. Named collections: nok-kb-uc1, nok-kb-uc2, nok-kb-shared. Cross-encoder reranker pipeline. Metadata filtering by Production Infrastructure product version, doc type. Handles tens of millions of vectors. All agent pods query via gRPC client — fully multi-instance safe.
Four-Phase Multi-Agent Framework Evolution
Each phase is independently deliverable and provides incremental production value. Strictly additive — no phase discards what was built before. The architecture grows with the engagement timeline.
■ Phase 0 (2–3 wks) Harden existing stack → Phase 1 Demo ready in Production Infrastructure Labs
■ Phase 1 (6–8 wks) LangGraph overlay + Qdrant + Valkey → Phase 2 MVP ready on TELAI
■ Phase 2 (8–10 wks) Agent isolation + KEDA + Kafka + Vault/Istio → Phase 3 production ready
■ Phase 3 (12+ wks) Adaptive cognition + episodic memory + MCP graph endpoints
LangGraph Routes. Multi-Agent Framework Executes.
The recommendation is neither "Multi-Agent Framework as-is" nor "rebuild in LangGraph from scratch." The hybrid principle: LangGraph controls what runs and when. Multi-Agent Framework controls how each agent runs and what it has access to. Production Infrastructure engineers never touch graph code.
LangGraph Supervisor treats each Multi-Agent Framework agent as a callable Python node in the StateGraph. Multi-Agent Framework YAML configs continue to define agent personas, tools, and MCP access — zero change for Production Infrastructure engineers. Routing logic, confidence gates, retry policies, and HiTL interrupts live in Python graph code owned by architects. The two layers are independently evolvable.
Ownership Boundary — YAML vs Graph Code
| Concern | Owned By | How Changed | Who Changes It |
|---|---|---|---|
| Agent persona / system prompt | Multi-Agent Framework YAML | Edit YAML, config reload — no pod redeploy needed | Production Infrastructure engineers, domain experts |
| Tool grants per agent | Multi-Agent Framework YAML | Add / remove tools in agent YAML config | Production Infrastructure engineers |
| MCP server selection per agent | Multi-Agent Framework YAML | Add MCP server to agent's YAML config section | Production Infrastructure engineers |
| LLM model override per agent | Multi-Agent Framework YAML | Change llm.model field in agent YAML | Production Infrastructure engineers |
| Which agent runs after which | LangGraph Python | Conditional edge function in graph code | Architects (KS / Vikas) |
| Confidence threshold value | LangGraph Python | Python constant in graph node — single line | Architects |
| HiTL interrupt points | LangGraph Python | interrupt() call placement in node function | Architects |
| Retry logic / max retries | LangGraph Python | Edge condition counter in graph state | Architects |
| Output routing (Helix vs NOK UI) | LangGraph Python | Conditional edge on output type / destination field | Architects |
MCP server exposure (/mcp): Multi-Agent Framework natively exposes itself as an MCP server. Any MCP-compatible client — Claude Desktop, VSCode Copilot, other Production Infrastructure tools — can consume Multi-Agent Framework capabilities out of the box. LangGraph has no equivalent.
A2A protocol (/a2a/v1): Built-in Agent-to-Agent protocol enables Production Infrastructure agents deployed in different clusters or services to call each other across network boundaries. Not in LangGraph.
YAML zero-code agent definition: Production Infrastructure engineers add a new agent persona by editing a YAML file with no Python changes and no redeployment. In a multi-OpCo rollout where each OpCo needs slightly different agent configs, this is commercially significant — it keeps agent customisation in Production Infrastructure's hands, not the architect's.
Scoring — Three Options Compared
13-dimension weighted comparison. Delivery speed dimensions are weighted Critical given the March-focused TEL–NOK engagement. Greenfield HMA LangGraph scores higher on orchestration and HA design — but collapses on every dimension that determines whether Phase 2 delivers on time.
| Dimension | Multi-Agent Framework Now | Evolved Multi-Agent Framework (Phase 1 complete) |
Greenfield HMA (12 wks build) |
Weight |
|---|---|---|---|---|
| Deterministic orchestration | 3/10 |
9/10 |
10/10 |
HIGH |
| Phase 1 demo readiness | 9/10 |
10/10 |
1/10 |
CRITICAL |
| Phase 2 MVP delivery speed | 8/10 |
9/10 |
3/10 |
CRITICAL |
| Parallel agent execution | 2/10 |
8/10 |
10/10 |
MEDIUM |
| Production HA (Phase 3) | 3/10 |
9/10 |
9/10 |
HIGH |
| Production Infrastructure KB / RAG scale | 4/10 |
9/10 |
9/10 |
HIGH |
| LLM cost governance | 7/10 |
9/10 |
10/10 |
MEDIUM |
| MCP / A2A ecosystem | 10/10 |
10/10 |
2/10 |
MEDIUM |
| YAML zero-code config | 10/10 |
10/10 |
1/10 |
MEDIUM |
| Security / Vault / OPA | 5/10 |
9/10 |
9/10 |
HIGH |
| Observability | 8/10 |
9/10 |
9/10 |
MEDIUM |
| Reuse of Production Infrastructure work | 10/10 |
10/10 |
1/10 |
CRITICAL |
| Production Infrastructure CMG domain configs | 9/10 |
10/10 |
1/10 |
CRITICAL |
| Weighted Total | 6.1 / 10 | 9.2 / 10 ✓ RECOMMENDED | 6.7 / 10 |
Do not replace Multi-Agent Framework. Evolve it.
Greenfield HMA LangGraph scores higher on orchestration and parallel execution — but only because it hasn't been built yet. When delivery speed is weighted appropriately for the March-focused engagement, the evolved Multi-Agent Framework path (9.2) outscores greenfield (6.7) by 2.5 points. The three production gaps — non-deterministic routing, single-process state, and no parallel execution — are all closable in phases without discarding the Production Infrastructure CMG domain work, LLM Gateway integration, MCP exposure, or YAML-driven agent definition that Multi-Agent Framework already provides.
Phase 0 (2–3 wks) · PG session store + Langfuse + LLM GW validation → Demo ready
Phase 1 (6–8 wks) · LangGraph overlay + Qdrant + Valkey → Phase 2 MVP ready
Phase 2 (8–10 wks) · Agent isolation + KEDA + Kafka + Vault/Istio → Phase 3 production ready
Phase 3 (12+ wks) · Adaptive cognition + episodic memory + MCP graph endpoints