Our AI Agent Architecture for Customer Support
AI-powered customer support is the most requested feature we build right now. Every B2B SaaS founder wants one. Most implementations we've seen in the wild are either dangerously unreliable or expensive toys that deflect 10% of tickets at best.
Here's the architecture we've refined across four production deployments.
Why Most AI Support Bots Fail
The gap between a ChatGPT wrapper demo and a production support agent is enormous:
| Demo | Production | | ------------------------- | ----------------------------------------------------------------- | | Answers from a single PDF | Answers from 500+ docs, changelogs, tickets, and internal wikis | | No guardrails | Must never hallucinate billing info or make unauthorized promises | | No escalation | Seamless handoff to human agent with full context | | No tracking | Every response logged, scored, and reviewed | | English only | Multi-language with consistent quality |
The Architecture
Our standard AI support agent has five layers:
┌─────────────────────────────────────────┐
│ User Interface │
│ (Chat widget, Slack, Email, API) │
├─────────────────────────────────────────┤
│ Orchestration Layer │
│ (Intent classification + routing) │
├─────────────────────────────────────────┤
│ RAG Pipeline │
│ (Retrieval → Reranking → Generation) │
├─────────────────────────────────────────┤
│ Guardrail Layer │
│ (Output validation + safety checks) │
├─────────────────────────────────────────┤
│ Feedback Loop │
│ (Logging, scoring, drift detection) │
└─────────────────────────────────────────┘
Layer 1: Intent Classification
Before any RAG retrieval happens, we classify the user's intent. This is a lightweight LLM call (GPT-3.5-turbo is sufficient) that routes to the right handler:
- FAQ / How-to → RAG pipeline (most queries)
- Account-specific → Check auth, fetch user data, then RAG with context
- Billing / Refund → Immediate escalation to human (never let AI handle money)
- Bug report → Structured extraction → Create ticket in Jira/Linear
- Angry / Frustrated → Empathetic response template + priority escalation
This classification step alone eliminates 30% of bad responses before they happen.
Layer 2: RAG Pipeline
Our retrieval pipeline processes documents through three stages:
Ingestion
- Chunk documents by semantic boundaries (not fixed character counts)
- Overlap chunks by 10-15% for context continuity
- Embed using
text-embedding-3-small(best cost/quality ratio we've found) - Store in Pinecone with metadata: source, date, category, version
Retrieval
- Hybrid search: vector similarity + BM25 keyword matching
- Retrieve top 20 candidates (over-fetch intentionally)
- Rerank using Cohere Rerank or a cross-encoder model
- Return top 5 chunks to the generation prompt
Generation
- System prompt with strict behavioral rules
- Retrieved chunks injected as context with source citations
- Temperature set to 0.1 (we want consistency, not creativity)
- Response must include source links
system_prompt = """You are a support agent for {company_name}.
Answer ONLY from the provided context. If the context doesn't
contain the answer, say: 'I don't have that information. Let me
connect you with our team.'
NEVER:
- Make up features or capabilities
- Discuss pricing unless it's in the context
- Promise timelines or guarantees
- Share internal information
ALWAYS:
- Cite your source with [Source: document_name]
- Ask for clarification if the question is ambiguous
- Offer to escalate if you're not confident
"""
Layer 3: Guardrails
Every generated response passes through validation before reaching the user:
- Hallucination check — Does the response contain claims not grounded in the retrieved chunks?
- PII scan — Strip any personally identifiable information that leaked from training data
- Tone check — Is the response professional and empathetic?
- Forbidden topics — Block responses about competitors, legal advice, or billing modifications
- Confidence score — If the model's self-reported confidence is below 0.7, escalate to human
We implement these as a pipeline of lightweight checks, not a single monolithic validator. Total added latency: ~200ms.
Layer 4: Human Escalation
The escalation handoff is where most AI support implementations fall apart. Our approach:
- Context transfer — The human agent receives the full conversation, the retrieved documents, and the AI's draft response
- Warm handoff — The AI says "I'm connecting you with a team member who can help with this directly" — not a cold redirect
- Escalation triggers — Anger detection, billing topics, repeated "I don't know" responses, explicit user request
- Agent assist mode — After escalation, AI suggests responses to the human agent (they approve before sending)
Layer 5: Feedback and Drift Detection
Production AI systems degrade silently. We monitor:
- Resolution rate — What % of conversations are resolved without escalation?
- Thumbs up/down ratio — Per-response user feedback
- Embedding drift — Are new queries clustering in regions with poor coverage?
- Response latency — P95 must stay under 3 seconds
- Escalation trends — Rising escalation rate = documentation gap or model issue
Weekly automated reports flag anomalies. Monthly manual review of the lowest-scored conversations.
Production Results
Across our four deployments:
| Metric | Before AI Agent | After (Week 8) | | ----------------------------- | --------------- | ----------------------------- | | Tier-1 ticket volume | 100% manual | 65-75% automated | | Average response time | 2-6 hours | Under 30 seconds | | Customer satisfaction | 3.8/5 | 4.3/5 | | Support team headcount needed | Fixed | Steady despite 2x user growth | | Monthly support cost | $15K-25K | $8K-12K |
The ROI typically pays for the entire build within 3-4 months.
Tech Stack
Our standard stack for AI support agents:
- Orchestration: Python + FastAPI (or Next.js API routes for simpler cases)
- LLM: OpenAI GPT-4 (generation) + GPT-3.5-turbo (classification)
- Embeddings: OpenAI text-embedding-3-small
- Vector Store: Pinecone (managed) or pgvector (self-hosted)
- Reranking: Cohere Rerank API
- Observability: LangSmith or custom logging to PostgreSQL
- Frontend: React widget with WebSocket for streaming responses
Build vs. Buy
If your support volume is under 500 tickets/month, use Intercom's AI or Zendesk's Answer Bot. The ROI of a custom build doesn't justify itself until you're processing 1,000+ monthly tickets or have domain-specific requirements that off-the-shelf tools can't handle.
If you're past that threshold and want an agent that actually understands your product, let's build it.
Austin Coders
We build SaaS & AI apps that actually scale. React, Next.js, and AI-powered solutions for startups and enterprises.