RAG and Chain of Thought Implementation Details

This document explains how RAG (Retrieval-Augmented Generation) and Chain of Thought (CoT) are implemented in the LangGraph routing approach.

Overview

RAG (Retrieval-Augmented Generation)

Purpose: Retrieve relevant hierarchy nodes from the vector database before LLM reasoning

How it works:

Query is embedded into a vector
Vector similarity search finds top candidates from Qdrant
Filtered by level, parent_id, tenant/app, and access control
Top candidates are passed to LLM as context

Chain of Thought (CoT)

Purpose: Guide LLM to reason step-by-step before making routing decisions

How it works:

LLM receives query + retrieved candidates
Prompt instructs LLM to think through reasoning steps
LLM outputs explicit reasoning chain
Final decision based on reasoning

RAG Implementation

1. Vector Store Setup

The hierarchy nodes are stored in Qdrant as searchable documents:

# When registering nodes, they're stored as documents
documents = []

for service in hierarchy.services:
    doc = {
        "id": service.id,
        "text": f"""
        Service: {service.name}
        Type: {service.root_type}
        Description: {service.description}
        Use Cases: {service.level_context.use_cases}
        Capabilities: {service.level_context.capabilities}
        Domain: {service.level_context.domain_context}
        Keywords: {service.level_context.routing_keywords.boost_keywords}
        Access Roles: {service.allowed_roles}
        """,
        "metadata": {
            "level": 0,
            "type": "service",
            "root_type": service.root_type,
            "tenant_id": service.tenant_id,
            "app_id": service.app_id,
            "allowed_roles": service.allowed_roles
        },
        "vector": service.embeddings.combined_embedding  # Pre-computed embedding
    }
    documents.append(doc)

# Store in Qdrant vector database
vector_store.add_documents(documents)

Current System: This is already implemented! Nodes are stored in Qdrant with embeddings.

2. RAG Retrieval Function

def retrieve_candidates(
    query: str,
    level: int,
    parent_id: Optional[str] = None,
    tenant_id: str,
    app_id: str,
    user_roles: List[str],
    top_k: int = 10
):
    """
    RAG retrieval: Semantic similarity search with filters
    
    Steps:
    1. Embed the query
    2. Vector search in Qdrant
    3. Filter by level, parent, tenant/app, roles
    4. Return top candidates with similarity scores
    """
    
    # Step 1: Embed query
    query_embedding = embedding_service.embed(query)
    
    # Step 2: Build Qdrant filter
    filter_conditions = {
        "must": [
            {"key": "level", "match": {"value": level}},
            {"key": "tenant_id", "match": {"value": tenant_id}},
            {"key": "app_id", "match": {"value": app_id}},
            {"key": "status", "match": {"value": "active"}}
        ]
    }
    
    # Add parent filter if not root level
    if parent_id:
        filter_conditions["must"].append({
            "key": "parent_id",
            "match": {"value": parent_id}
        })
    
    # Add role-based access control filter
    # At least one role must match
    role_filter = {
        "key": "allowed_roles",
        "match": {"any": user_roles}  # Qdrant: match any role in array
    }
    filter_conditions["must"].append(role_filter)
    
    # Step 3: Vector similarity search
    results = vector_store.search(
        query_vector=query_embedding,
        top_k=top_k,
        filter=filter_conditions,
        with_payload=True  # Return full node data
    )
    
    # Step 4: Format results
    candidates = []
    for result in results:
        candidates.append({
            "id": result.payload["id"],
            "name": result.payload["name"],
            "description": result.payload["description"],
            "similarity": result.score,  # Cosine similarity (0-1)
            "metadata": result.payload["metadata"],
            "full_data": result.payload  # All node fields
        })
    
    return candidates

Current System Equivalent: This is similar to SemanticMatcher.matchNodes() which already does vector search with filters.

3. RAG in LangGraph Nodes

Each LangGraph node uses RAG to retrieve candidates:

def service_selection_node(state: RoutingState) -> RoutingState:
    """
    Service Selection Node:
    1. RAG Retrieval: Get top service candidates
    2. LLM Reasoning: Select best service with CoT
    """
    
    # RAG STEP: Retrieve candidates
    candidates = retrieve_candidates(
        query=state["query"],
        level=0,  # Services are level 0
        parent_id=None,  # Services have no parent
        tenant_id=state["user_context"]["tenant_id"],
        app_id=state["user_context"]["app_id"],
        user_roles=state["user_context"]["roles"],
        top_k=10  # Get top 10 candidates
    )
    
    # Filter to top 5 for LLM context (reduce token usage)
    top_candidates = candidates[:5]
    
    # Build context string for LLM
    context = "\n".join([
        f"Service {i+1}: {c['name']}\n"
        f"  Type: {c['metadata']['root_type']}\n"
        f"  Description: {c['description']}\n"
        f"  Keywords: {c['metadata'].get('keywords', [])}\n"
        f"  Similarity: {c['similarity']:.2f}\n"
        for i, c in enumerate(top_candidates)
    ])
    
    # Pass to LLM for reasoning (see CoT section below)
    # ...
    
    return state

Key Points:

RAG reduces search space: Instead of reasoning over all nodes, LLM only sees top candidates
Fast retrieval: Vector search is fast (~10-50ms)
Filtered results: Only relevant, accessible nodes are retrieved

Chain of Thought Implementation

1. CoT Prompt Structure

The LLM is instructed to reason step-by-step before making a decision:

def build_cot_prompt(query: str, candidates: List[dict], level: str) -> str:
    """
    Build Chain of Thought prompt that guides LLM reasoning
    """
    
    prompt = f"""
    You are a routing assistant. Analyze the user query and select the best {level}.
    
    User Query: "{query}"
    User Roles: {user_roles}
    
    Available {level}s:
    {format_candidates(candidates)}
    
    Use Chain of Thought reasoning. Think through each step:
    
    Step 1: Analyze Query Intent
    - What is the user trying to do?
    - Is this a documentation search, data query, API call, or MCP configuration?
    - Identify the primary intent: [documentation_search | data_query | api_call | mcp_config]
    
    Step 2: Identify Keywords
    - Extract key terms from the query
    - List all relevant keywords: [keyword1, keyword2, ...]
    
    Step 3: Match Keywords to Candidates
    - For each candidate, check if keywords match boost_keywords
    - Count keyword matches: Candidate X has Y keyword matches
    - Identify penalty keywords: Candidate X has Z penalty keywords
    
    Step 4: Evaluate Service Type Match
    - Check if query intent matches candidate root_type
    - SEARCH: documentation, knowledge base, help articles
    - SQL: data queries, inventory, reports, analytics
    - REST: API calls, service status, health checks
    - MCP: tool configuration, agent setup
    
    Step 5: Check Access Control
    - Verify user roles match candidate allowed_roles
    - User has roles: {user_roles}
    - Candidate requires: [list of allowed roles]
    - Access granted: [yes/no]
    
    Step 6: Evaluate Semantic Similarity
    - Compare query embedding to candidate embedding
    - Similarity scores: Candidate X = 0.XX
    - Higher similarity = better match
    
    Step 7: Apply Keyword Boosts/Penalties
    - For each candidate:
      - Base similarity: 0.XX
      - Keyword boost: +0.XX (if keywords match)
      - Penalty: -0.XX (if penalty keywords match)
      - Final score: 0.XX
    
    Step 8: Make Final Decision
    - Compare all candidates
    - Select best match based on:
      * Intent match (most important)
      * Keyword matches
      * Semantic similarity
      * Access control
    - Confidence: [0.0-1.0]
    - Reasoning: [brief explanation]
    
    Output your reasoning in this format:
    
    REASONING:
    Step 1: [your analysis]
    Step 2: [your analysis]
    ...
    Step 8: [your analysis]
    
    DECISION:
    Selected {level} ID: [id]
    Confidence: [0.0-1.0]
    Reasoning Summary: [1-2 sentence explanation]
    """
    
    return prompt

2. CoT in Service Selection Node

def service_selection_node(state: RoutingState) -> RoutingState:
    """
    Service Selection with Chain of Thought
    """
    
    # RAG: Retrieve candidates
    candidates = retrieve_candidates(
        query=state["query"],
        level=0,
        tenant_id=state["user_context"]["tenant_id"],
        app_id=state["user_context"]["app_id"],
        user_roles=state["user_context"]["roles"]
    )
    
    # Build CoT prompt
    prompt = build_cot_prompt(
        query=state["query"],
        candidates=candidates[:5],  # Top 5 for LLM
        level="service",
        user_roles=state["user_context"]["roles"]
    )
    
    # LLM call with structured output
    response = llm.invoke(prompt)
    
    # Parse LLM response
    decision = parse_llm_response(response)
    # Returns: {
    #   "selected_id": "sql-service",
    #   "confidence": 0.90,
    #   "reasoning": ["Step 1: ...", "Step 2: ...", ...],
    #   "reasoning_summary": "Strong keyword match + intent match"
    # }
    
    # Update state
    selected_service = next(c for c in candidates if c["id"] == decision["selected_id"])
    state["selected_node"] = selected_service
    state["confidence"] = decision["confidence"]
    state["reasoning"].extend(decision["reasoning"])  # Add all reasoning steps
    state["current_path"].append(selected_service["id"])
    
    return state

3. Example CoT Reasoning Output

For query: "Show network device inventory"

REASONING:
Step 1: Analyze Query Intent
- User wants to "show" data about "network device inventory"
- This is a data query (user wants to retrieve data, not search documents)
- Primary intent: data_query

Step 2: Identify Keywords
- Key terms: ["show", "network", "device", "inventory"]
- Important keywords: ["inventory", "device", "network"]

Step 3: Match Keywords to Candidates
- sql-service:
  - Boost keywords: ["inventory", "server", "device", "network device"]
  - Matches: "inventory" ✓, "device" ✓, "network device" ✓
  - Total matches: 3
- document-search-service:
  - Boost keywords: ["search", "find", "documentation"]
  - Matches: "show" (partial, but not in boost list)
  - Total matches: 0
  - Penalty keywords: ["inventory", "server", "device"]
  - Penalties: "inventory" ✗, "device" ✗

Step 4: Evaluate Service Type Match
- Query intent: data_query
- sql-service root_type: SQL → Matches data_query ✓
- document-search-service root_type: SEARCH → Doesn't match data_query ✗

Step 5: Check Access Control
- User roles: ["developer"]
- sql-service allowed_roles: ["admin", "developer", "analyst"] → Access granted ✓
- document-search-service allowed_roles: ["admin", "developer", "analyst"] → Access granted ✓

Step 6: Evaluate Semantic Similarity
- sql-service similarity: 0.75 (high match with "Server Inventory Database")
- document-search-service similarity: 0.65 (lower match)

Step 7: Apply Keyword Boosts/Penalties
- sql-service:
  - Base similarity: 0.75
  - Keyword boost: +0.6 (3 keyword matches × 0.2 each)
  - Final score: min(0.75 + 0.6, 1.0) = 1.0
- document-search-service:
  - Base similarity: 0.65
  - Penalty: -0.4 (2 penalty keywords × -0.2 each)
  - Final score: max(0.65 - 0.4, 0.0) = 0.25

Step 8: Make Final Decision
- sql-service: Score 1.0, Intent match ✓, Access ✓
- document-search-service: Score 0.25, Intent mismatch ✗, Access ✓
- Best match: sql-service
- Confidence: 0.95 (very high - clear match)

DECISION:
Selected service ID: sql-service
Confidence: 0.95
Reasoning Summary: Strong keyword match (3 matches) + intent match (data_query → SQL) + high semantic similarity (0.75) = clear winner

4. Structured Output Parsing

def parse_llm_response(response: str) -> dict:
    """
    Parse LLM CoT response into structured format
    """
    
    # Extract reasoning steps
    reasoning_match = re.search(r'REASONING:\s*(.*?)(?=DECISION:)', response, re.DOTALL)
    reasoning_text = reasoning_match.group(1) if reasoning_match else ""
    
    # Split into steps
    reasoning_steps = []
    for line in reasoning_text.split('\n'):
        if line.strip().startswith('Step'):
            reasoning_steps.append(line.strip())
    
    # Extract decision
    decision_match = re.search(
        r'Selected.*?ID:\s*([^\n]+).*?Confidence:\s*([0-9.]+).*?Reasoning Summary:\s*([^\n]+)',
        response,
        re.DOTALL
    )
    
    if not decision_match:
        raise ValueError("Could not parse LLM response")
    
    return {
        "selected_id": decision_match.group(1).strip(),
        "confidence": float(decision_match.group(2)),
        "reasoning": reasoning_steps,
        "reasoning_summary": decision_match.group(3).strip()
    }

Complete Flow: RAG + CoT Together

Example: Routing "Show network device inventory"

┌─────────────────────────────────────────────────────────────┐
│ Step 1: RAG Retrieval (Service Level)                        │
├─────────────────────────────────────────────────────────────┤
│ Query: "Show network device inventory"                       │
│ → Embed query                                                │
│ → Vector search in Qdrant (level=0, filters: tenant/app)   │
│ → Returns top 10 candidates:                                │
│   1. sql-service (similarity: 0.75)                         │
│   2. document-search-service (similarity: 0.65)              │
│   3. rest-service (similarity: 0.45)                         │
│   ...                                                        │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Step 2: Chain of Thought Reasoning                          │
├─────────────────────────────────────────────────────────────┤
│ LLM receives:                                                │
│ - Query: "Show network device inventory"                     │
│ - Top 5 candidates with metadata                            │
│                                                              │
│ LLM reasons through 8 steps:                                │
│ Step 1: Intent = data_query                                 │
│ Step 2: Keywords = ["inventory", "device", "network"]       │
│ Step 3: sql-service has 3 keyword matches                    │
│ Step 4: SQL type matches data_query intent                   │
│ Step 5: User has access (developer role)                    │
│ Step 6: Semantic similarity: sql-service (0.75)            │
│ Step 7: Final score: sql-service (1.0)                      │
│ Step 8: Decision: sql-service, confidence 0.95             │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Step 3: Update State                                        │
├─────────────────────────────────────────────────────────────┤
│ state["selected_node"] = sql-service                         │
│ state["confidence"] = 0.95                                  │
│ state["reasoning"] = [all 8 steps]                          │
│ state["current_path"] = ["sql-service"]                     │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Step 4: RAG Retrieval (Category Level)                     │
├─────────────────────────────────────────────────────────────┤
│ Query: "Show network device inventory"                       │
│ Parent: sql-service                                          │
│ → Vector search (level=1, parent_id=sql-service)             │
│ → Returns top 5 categories                                  │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Step 5: Chain of Thought (Category Selection)              │
├─────────────────────────────────────────────────────────────┤
│ LLM reasons through category selection                      │
│ (similar 8-step process)                                     │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Step 6: RAG Retrieval (Collection Level)                    │
├─────────────────────────────────────────────────────────────┤
│ Query: "Show network device inventory"                       │
│ Parent: infrastructure-data-category                         │
│ → Vector search (level=2, parent_id=category)               │
│ → Returns top 3 collections                                 │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Step 7: Chain of Thought (Collection Selection)            │
├─────────────────────────────────────────────────────────────┤
│ LLM reasons through collection selection                    │
│ → Selects: server-inventory-db                               │
│ → Confidence: 0.95                                          │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Final Response                                              │
├─────────────────────────────────────────────────────────────┤
│ {                                                            │
│   "route": {                                                 │
│     "source_id": "server-inventory-db",                      │
│     "hierarchy_path": [                                      │
│       "sql-service",                                         │
│       "infrastructure-data-category",                        │
│       "server-inventory-db"                                  │
│     ]                                                        │
│   },                                                         │
│   "confidence": 0.95,                                        │
│   "reasoning": [                                             │
│     "Step 1: Intent = data_query",                           │
│     "Step 2: Keywords = [inventory, device, network]",       │
│     "Step 3: sql-service has 3 keyword matches",            │
│     ...                                                      │
│   ],                                                         │
│   "metadata": {                                              │
│     "routing_mode": "ai",                                    │
│     "llm_calls": 3,  // One per level                        │
│     "reasoning_steps": 24  // 8 steps × 3 levels            │
│   }                                                          │
│ }                                                            │
└─────────────────────────────────────────────────────────────┘

Comparison: Current System vs RAG + CoT

Current System (Deterministic)

// Current: Pure algorithmic approach
1. Embed query
2. Vector search → Get candidates
3. Calculate similarity scores
4. Apply keyword boosts/penalties (rule-based)
5. Select highest score
6. No explicit reasoning

Pros:

Fast (~50ms)
No LLM costs
Deterministic (same query = same result)

Cons:

No explainability (just scores)
Hard to adapt (requires code changes)
Can't handle ambiguous queries well

RAG + CoT System

# RAG + CoT: AI-powered reasoning
1. RAG: Embed query → Vector search → Get candidates (same as current)
2. CoT: LLM reasons through 8 steps
3. LLM makes decision with explicit reasoning
4. Return route + reasoning chain

Pros:

Explainable (full reasoning chain)
Adaptive (handles edge cases)
Better for ambiguous queries
Can provide alternatives

Cons:

Slower (~500-2000ms)
Higher cost ($0.01-0.05 per query)
Non-deterministic (may vary slightly)

Hybrid Approach (Best of Both)

Use RAG for Retrieval (Fast)

# RAG: Fast vector search (same as current)
candidates = retrieve_candidates(query, level, ...)  # ~10-50ms

Use CoT for Reasoning (Intelligent)

# CoT: LLM reasoning (only for ambiguous queries)
if is_ambiguous_query(query, candidates):
    decision = llm_cot_reasoning(query, candidates)  # ~500ms
else:
    decision = deterministic_selection(candidates)  # ~1ms

Benefits

Fast for clear queries: Use deterministic (current system)
Intelligent for ambiguous: Use RAG + CoT
Best of both worlds: Speed + explainability

Implementation in Current Codebase

What's Already There (RAG Foundation)

Vector Store: Qdrant with hierarchy nodes ✅
- src/services/hierarchy-registry/modules/node-repository.ts
- Nodes stored with embeddings
Vector Search: Semantic matching ✅
- src/services/semantic-matcher.service.ts
- matchNodes() does vector search with filters
Embedding Generation: Node embeddings ✅
- src/services/hierarchy-registry/modules/embedding-generator.ts
- Generates embeddings for all node fields

What Needs to Be Added (CoT Layer)

LLM Integration: Add LangChain/LangGraph

npm install @langchain/langgraph @langchain/openai

CoT Prompts: Build reasoning prompts
- Create prompt templates
- Define reasoning steps
LangGraph Workflow: Orchestrate nodes
- Service selection node
- Category selection node
- Collection selection node
Response Parsing: Extract reasoning from LLM
- Parse structured output
- Extract reasoning steps

Key Takeaways

RAG = Fast Retrieval: Vector search finds relevant candidates quickly
CoT = Intelligent Reasoning: LLM thinks through decision step-by-step
Together: RAG narrows search space, CoT makes intelligent decision
Current System: Already has RAG foundation (vector search)
Missing: CoT layer (LLM reasoning) needs to be added

Next Steps

To implement RAG + CoT:

Add LangGraph dependencies
Create CoT prompt templates
Build LangGraph workflow nodes
Integrate with existing RAG (vector search)
Add new /api/v1/route/ai endpoint
Test and optimize

The RAG foundation is already there - we just need to add the CoT reasoning layer on top!