RAG and Chain of Thought Implementation Details
RAG and Chain of Thought Implementation Details
This document explains how RAG (Retrieval-Augmented Generation) and Chain of Thought (CoT) are implemented in the LangGraph routing approach.
Overview
RAG (Retrieval-Augmented Generation)
Purpose: Retrieve relevant hierarchy nodes from the vector database before LLM reasoning
How it works:
- Query is embedded into a vector
- Vector similarity search finds top candidates from Qdrant
- Filtered by level, parent_id, tenant/app, and access control
- Top candidates are passed to LLM as context
Chain of Thought (CoT)
Purpose: Guide LLM to reason step-by-step before making routing decisions
How it works:
- LLM receives query + retrieved candidates
- Prompt instructs LLM to think through reasoning steps
- LLM outputs explicit reasoning chain
- Final decision based on reasoning
RAG Implementation
1. Vector Store Setup
The hierarchy nodes are stored in Qdrant as searchable documents:
# When registering nodes, they're stored as documents
documents = []
for service in hierarchy.services:
doc = {
"id": service.id,
"text": f"""
Service: {service.name}
Type: {service.root_type}
Description: {service.description}
Use Cases: {service.level_context.use_cases}
Capabilities: {service.level_context.capabilities}
Domain: {service.level_context.domain_context}
Keywords: {service.level_context.routing_keywords.boost_keywords}
Access Roles: {service.allowed_roles}
""",
"metadata": {
"level": 0,
"type": "service",
"root_type": service.root_type,
"tenant_id": service.tenant_id,
"app_id": service.app_id,
"allowed_roles": service.allowed_roles
},
"vector": service.embeddings.combined_embedding # Pre-computed embedding
}
documents.append(doc)
# Store in Qdrant vector database
vector_store.add_documents(documents)
Current System: This is already implemented! Nodes are stored in Qdrant with embeddings.
2. RAG Retrieval Function
def retrieve_candidates(
query: str,
level: int,
parent_id: Optional[str] = None,
tenant_id: str,
app_id: str,
user_roles: List[str],
top_k: int = 10
):
"""
RAG retrieval: Semantic similarity search with filters
Steps:
1. Embed the query
2. Vector search in Qdrant
3. Filter by level, parent, tenant/app, roles
4. Return top candidates with similarity scores
"""
# Step 1: Embed query
query_embedding = embedding_service.embed(query)
# Step 2: Build Qdrant filter
filter_conditions = {
"must": [
{"key": "level", "match": {"value": level}},
{"key": "tenant_id", "match": {"value": tenant_id}},
{"key": "app_id", "match": {"value": app_id}},
{"key": "status", "match": {"value": "active"}}
]
}
# Add parent filter if not root level
if parent_id:
filter_conditions["must"].append({
"key": "parent_id",
"match": {"value": parent_id}
})
# Add role-based access control filter
# At least one role must match
role_filter = {
"key": "allowed_roles",
"match": {"any": user_roles} # Qdrant: match any role in array
}
filter_conditions["must"].append(role_filter)
# Step 3: Vector similarity search
results = vector_store.search(
query_vector=query_embedding,
top_k=top_k,
filter=filter_conditions,
with_payload=True # Return full node data
)
# Step 4: Format results
candidates = []
for result in results:
candidates.append({
"id": result.payload["id"],
"name": result.payload["name"],
"description": result.payload["description"],
"similarity": result.score, # Cosine similarity (0-1)
"metadata": result.payload["metadata"],
"full_data": result.payload # All node fields
})
return candidates
Current System Equivalent: This is similar to SemanticMatcher.matchNodes() which already does vector search with filters.
3. RAG in LangGraph Nodes
Each LangGraph node uses RAG to retrieve candidates:
def service_selection_node(state: RoutingState) -> RoutingState:
"""
Service Selection Node:
1. RAG Retrieval: Get top service candidates
2. LLM Reasoning: Select best service with CoT
"""
# RAG STEP: Retrieve candidates
candidates = retrieve_candidates(
query=state["query"],
level=0, # Services are level 0
parent_id=None, # Services have no parent
tenant_id=state["user_context"]["tenant_id"],
app_id=state["user_context"]["app_id"],
user_roles=state["user_context"]["roles"],
top_k=10 # Get top 10 candidates
)
# Filter to top 5 for LLM context (reduce token usage)
top_candidates = candidates[:5]
# Build context string for LLM
context = "\n".join([
f"Service {i+1}: {c['name']}\n"
f" Type: {c['metadata']['root_type']}\n"
f" Description: {c['description']}\n"
f" Keywords: {c['metadata'].get('keywords', [])}\n"
f" Similarity: {c['similarity']:.2f}\n"
for i, c in enumerate(top_candidates)
])
# Pass to LLM for reasoning (see CoT section below)
# ...
return state
Key Points:
- RAG reduces search space: Instead of reasoning over all nodes, LLM only sees top candidates
- Fast retrieval: Vector search is fast (~10-50ms)
- Filtered results: Only relevant, accessible nodes are retrieved
Chain of Thought Implementation
1. CoT Prompt Structure
The LLM is instructed to reason step-by-step before making a decision:
def build_cot_prompt(query: str, candidates: List[dict], level: str) -> str:
"""
Build Chain of Thought prompt that guides LLM reasoning
"""
prompt = f"""
You are a routing assistant. Analyze the user query and select the best {level}.
User Query: "{query}"
User Roles: {user_roles}
Available {level}s:
{format_candidates(candidates)}
Use Chain of Thought reasoning. Think through each step:
Step 1: Analyze Query Intent
- What is the user trying to do?
- Is this a documentation search, data query, API call, or MCP configuration?
- Identify the primary intent: [documentation_search | data_query | api_call | mcp_config]
Step 2: Identify Keywords
- Extract key terms from the query
- List all relevant keywords: [keyword1, keyword2, ...]
Step 3: Match Keywords to Candidates
- For each candidate, check if keywords match boost_keywords
- Count keyword matches: Candidate X has Y keyword matches
- Identify penalty keywords: Candidate X has Z penalty keywords
Step 4: Evaluate Service Type Match
- Check if query intent matches candidate root_type
- SEARCH: documentation, knowledge base, help articles
- SQL: data queries, inventory, reports, analytics
- REST: API calls, service status, health checks
- MCP: tool configuration, agent setup
Step 5: Check Access Control
- Verify user roles match candidate allowed_roles
- User has roles: {user_roles}
- Candidate requires: [list of allowed roles]
- Access granted: [yes/no]
Step 6: Evaluate Semantic Similarity
- Compare query embedding to candidate embedding
- Similarity scores: Candidate X = 0.XX
- Higher similarity = better match
Step 7: Apply Keyword Boosts/Penalties
- For each candidate:
- Base similarity: 0.XX
- Keyword boost: +0.XX (if keywords match)
- Penalty: -0.XX (if penalty keywords match)
- Final score: 0.XX
Step 8: Make Final Decision
- Compare all candidates
- Select best match based on:
* Intent match (most important)
* Keyword matches
* Semantic similarity
* Access control
- Confidence: [0.0-1.0]
- Reasoning: [brief explanation]
Output your reasoning in this format:
REASONING:
Step 1: [your analysis]
Step 2: [your analysis]
...
Step 8: [your analysis]
DECISION:
Selected {level} ID: [id]
Confidence: [0.0-1.0]
Reasoning Summary: [1-2 sentence explanation]
"""
return prompt
2. CoT in Service Selection Node
def service_selection_node(state: RoutingState) -> RoutingState:
"""
Service Selection with Chain of Thought
"""
# RAG: Retrieve candidates
candidates = retrieve_candidates(
query=state["query"],
level=0,
tenant_id=state["user_context"]["tenant_id"],
app_id=state["user_context"]["app_id"],
user_roles=state["user_context"]["roles"]
)
# Build CoT prompt
prompt = build_cot_prompt(
query=state["query"],
candidates=candidates[:5], # Top 5 for LLM
level="service",
user_roles=state["user_context"]["roles"]
)
# LLM call with structured output
response = llm.invoke(prompt)
# Parse LLM response
decision = parse_llm_response(response)
# Returns: {
# "selected_id": "sql-service",
# "confidence": 0.90,
# "reasoning": ["Step 1: ...", "Step 2: ...", ...],
# "reasoning_summary": "Strong keyword match + intent match"
# }
# Update state
selected_service = next(c for c in candidates if c["id"] == decision["selected_id"])
state["selected_node"] = selected_service
state["confidence"] = decision["confidence"]
state["reasoning"].extend(decision["reasoning"]) # Add all reasoning steps
state["current_path"].append(selected_service["id"])
return state
3. Example CoT Reasoning Output
For query: "Show network device inventory"
REASONING:
Step 1: Analyze Query Intent
- User wants to "show" data about "network device inventory"
- This is a data query (user wants to retrieve data, not search documents)
- Primary intent: data_query
Step 2: Identify Keywords
- Key terms: ["show", "network", "device", "inventory"]
- Important keywords: ["inventory", "device", "network"]
Step 3: Match Keywords to Candidates
- sql-service:
- Boost keywords: ["inventory", "server", "device", "network device"]
- Matches: "inventory" ✓, "device" ✓, "network device" ✓
- Total matches: 3
- document-search-service:
- Boost keywords: ["search", "find", "documentation"]
- Matches: "show" (partial, but not in boost list)
- Total matches: 0
- Penalty keywords: ["inventory", "server", "device"]
- Penalties: "inventory" ✗, "device" ✗
Step 4: Evaluate Service Type Match
- Query intent: data_query
- sql-service root_type: SQL → Matches data_query ✓
- document-search-service root_type: SEARCH → Doesn't match data_query ✗
Step 5: Check Access Control
- User roles: ["developer"]
- sql-service allowed_roles: ["admin", "developer", "analyst"] → Access granted ✓
- document-search-service allowed_roles: ["admin", "developer", "analyst"] → Access granted ✓
Step 6: Evaluate Semantic Similarity
- sql-service similarity: 0.75 (high match with "Server Inventory Database")
- document-search-service similarity: 0.65 (lower match)
Step 7: Apply Keyword Boosts/Penalties
- sql-service:
- Base similarity: 0.75
- Keyword boost: +0.6 (3 keyword matches × 0.2 each)
- Final score: min(0.75 + 0.6, 1.0) = 1.0
- document-search-service:
- Base similarity: 0.65
- Penalty: -0.4 (2 penalty keywords × -0.2 each)
- Final score: max(0.65 - 0.4, 0.0) = 0.25
Step 8: Make Final Decision
- sql-service: Score 1.0, Intent match ✓, Access ✓
- document-search-service: Score 0.25, Intent mismatch ✗, Access ✓
- Best match: sql-service
- Confidence: 0.95 (very high - clear match)
DECISION:
Selected service ID: sql-service
Confidence: 0.95
Reasoning Summary: Strong keyword match (3 matches) + intent match (data_query → SQL) + high semantic similarity (0.75) = clear winner
4. Structured Output Parsing
def parse_llm_response(response: str) -> dict:
"""
Parse LLM CoT response into structured format
"""
# Extract reasoning steps
reasoning_match = re.search(r'REASONING:\s*(.*?)(?=DECISION:)', response, re.DOTALL)
reasoning_text = reasoning_match.group(1) if reasoning_match else ""
# Split into steps
reasoning_steps = []
for line in reasoning_text.split('\n'):
if line.strip().startswith('Step'):
reasoning_steps.append(line.strip())
# Extract decision
decision_match = re.search(
r'Selected.*?ID:\s*([^\n]+).*?Confidence:\s*([0-9.]+).*?Reasoning Summary:\s*([^\n]+)',
response,
re.DOTALL
)
if not decision_match:
raise ValueError("Could not parse LLM response")
return {
"selected_id": decision_match.group(1).strip(),
"confidence": float(decision_match.group(2)),
"reasoning": reasoning_steps,
"reasoning_summary": decision_match.group(3).strip()
}
Complete Flow: RAG + CoT Together
Example: Routing "Show network device inventory"
┌─────────────────────────────────────────────────────────────┐
│ Step 1: RAG Retrieval (Service Level) │
├─────────────────────────────────────────────────────────────┤
│ Query: "Show network device inventory" │
│ → Embed query │
│ → Vector search in Qdrant (level=0, filters: tenant/app) │
│ → Returns top 10 candidates: │
│ 1. sql-service (similarity: 0.75) │
│ 2. document-search-service (similarity: 0.65) │
│ 3. rest-service (similarity: 0.45) │
│ ... │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 2: Chain of Thought Reasoning │
├─────────────────────────────────────────────────────────────┤
│ LLM receives: │
│ - Query: "Show network device inventory" │
│ - Top 5 candidates with metadata │
│ │
│ LLM reasons through 8 steps: │
│ Step 1: Intent = data_query │
│ Step 2: Keywords = ["inventory", "device", "network"] │
│ Step 3: sql-service has 3 keyword matches │
│ Step 4: SQL type matches data_query intent │
│ Step 5: User has access (developer role) │
│ Step 6: Semantic similarity: sql-service (0.75) │
│ Step 7: Final score: sql-service (1.0) │
│ Step 8: Decision: sql-service, confidence 0.95 │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 3: Update State │
├─────────────────────────────────────────────────────────────┤
│ state["selected_node"] = sql-service │
│ state["confidence"] = 0.95 │
│ state["reasoning"] = [all 8 steps] │
│ state["current_path"] = ["sql-service"] │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 4: RAG Retrieval (Category Level) │
├─────────────────────────────────────────────────────────────┤
│ Query: "Show network device inventory" │
│ Parent: sql-service │
│ → Vector search (level=1, parent_id=sql-service) │
│ → Returns top 5 categories │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 5: Chain of Thought (Category Selection) │
├─────────────────────────────────────────────────────────────┤
│ LLM reasons through category selection │
│ (similar 8-step process) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 6: RAG Retrieval (Collection Level) │
├─────────────────────────────────────────────────────────────┤
│ Query: "Show network device inventory" │
│ Parent: infrastructure-data-category │
│ → Vector search (level=2, parent_id=category) │
│ → Returns top 3 collections │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Step 7: Chain of Thought (Collection Selection) │
├─────────────────────────────────────────────────────────────┤
│ LLM reasons through collection selection │
│ → Selects: server-inventory-db │
│ → Confidence: 0.95 │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Final Response │
├─────────────────────────────────────────────────────────────┤
│ { │
│ "route": { │
│ "source_id": "server-inventory-db", │
│ "hierarchy_path": [ │
│ "sql-service", │
│ "infrastructure-data-category", │
│ "server-inventory-db" │
│ ] │
│ }, │
│ "confidence": 0.95, │
│ "reasoning": [ │
│ "Step 1: Intent = data_query", │
│ "Step 2: Keywords = [inventory, device, network]", │
│ "Step 3: sql-service has 3 keyword matches", │
│ ... │
│ ], │
│ "metadata": { │
│ "routing_mode": "ai", │
│ "llm_calls": 3, // One per level │
│ "reasoning_steps": 24 // 8 steps × 3 levels │
│ } │
│ } │
└─────────────────────────────────────────────────────────────┘
Comparison: Current System vs RAG + CoT
Current System (Deterministic)
// Current: Pure algorithmic approach
1. Embed query
2. Vector search → Get candidates
3. Calculate similarity scores
4. Apply keyword boosts/penalties (rule-based)
5. Select highest score
6. No explicit reasoning
Pros:
- Fast (~50ms)
- No LLM costs
- Deterministic (same query = same result)
Cons:
- No explainability (just scores)
- Hard to adapt (requires code changes)
- Can't handle ambiguous queries well
RAG + CoT System
# RAG + CoT: AI-powered reasoning
1. RAG: Embed query → Vector search → Get candidates (same as current)
2. CoT: LLM reasons through 8 steps
3. LLM makes decision with explicit reasoning
4. Return route + reasoning chain
Pros:
- Explainable (full reasoning chain)
- Adaptive (handles edge cases)
- Better for ambiguous queries
- Can provide alternatives
Cons:
- Slower (~500-2000ms)
- Higher cost ($0.01-0.05 per query)
- Non-deterministic (may vary slightly)
Hybrid Approach (Best of Both)
Use RAG for Retrieval (Fast)
# RAG: Fast vector search (same as current)
candidates = retrieve_candidates(query, level, ...) # ~10-50ms
Use CoT for Reasoning (Intelligent)
# CoT: LLM reasoning (only for ambiguous queries)
if is_ambiguous_query(query, candidates):
decision = llm_cot_reasoning(query, candidates) # ~500ms
else:
decision = deterministic_selection(candidates) # ~1ms
Benefits
- Fast for clear queries: Use deterministic (current system)
- Intelligent for ambiguous: Use RAG + CoT
- Best of both worlds: Speed + explainability
Implementation in Current Codebase
What's Already There (RAG Foundation)
-
Vector Store: Qdrant with hierarchy nodes ✅
src/services/hierarchy-registry/modules/node-repository.ts- Nodes stored with embeddings
-
Vector Search: Semantic matching ✅
src/services/semantic-matcher.service.tsmatchNodes()does vector search with filters
-
Embedding Generation: Node embeddings ✅
src/services/hierarchy-registry/modules/embedding-generator.ts- Generates embeddings for all node fields
What Needs to Be Added (CoT Layer)
-
LLM Integration: Add LangChain/LangGraph
npm install @langchain/langgraph @langchain/openai -
CoT Prompts: Build reasoning prompts
- Create prompt templates
- Define reasoning steps
-
LangGraph Workflow: Orchestrate nodes
- Service selection node
- Category selection node
- Collection selection node
-
Response Parsing: Extract reasoning from LLM
- Parse structured output
- Extract reasoning steps
Key Takeaways
- RAG = Fast Retrieval: Vector search finds relevant candidates quickly
- CoT = Intelligent Reasoning: LLM thinks through decision step-by-step
- Together: RAG narrows search space, CoT makes intelligent decision
- Current System: Already has RAG foundation (vector search)
- Missing: CoT layer (LLM reasoning) needs to be added
Next Steps
To implement RAG + CoT:
- Add LangGraph dependencies
- Create CoT prompt templates
- Build LangGraph workflow nodes
- Integrate with existing RAG (vector search)
- Add new
/api/v1/route/aiendpoint - Test and optimize
The RAG foundation is already there - we just need to add the CoT reasoning layer on top!