Professional Project: Software Engineer·LTM·2022 — 2024

RAG-Based Enterprise AI Chatbot

Production AI Support System with 60% Ticket Reduction on an Enterprise Insurance Platform

Stack

JavaSpring BootWebSocketLangChainOpenAI APIRedisTransformer-based NLU

Domain

AI/LLM PipelinesEnterprise BackendConversational AIProduction ML

60%

Support tickets eliminated

1M+

Platform users served

2 wks

Shadow-mode validation

RAG-Based Enterprise AI Chatbot: system architecture

Overview

The enterprise insurance platform I worked on at LTM served over one million users across multiple product lines (policy management, premium calculation, claims processing, and billing). At that scale, the customer support function fielded a high volume of inbound inquiries daily. Most of them were the same questions: policy status, premium amounts, claim progress, coverage details, renewal dates, cancellation procedures. Repetitive, structured, answerable from data, and yet every one of them landed in a human agent's queue.

I built the AI chatbot that changed that. A Spring Boot orchestration layer with WebSocket-backed real-time conversations, a transformer-based NLU model for intent classification and entity extraction, a LangChain and OpenAI pipeline for complex policy inquiry workflows, Redis-backed session state, and a confidence-threshold escalation path to human agents that handed off full conversation context rather than dropping the user mid-interaction. Shadow-deployed against live production traffic for two weeks before cutover.

The result: 60% reduction in support ticket volume. Human agents freed from repetitive inquiry handling to focus on complex, judgment-dependent cases where they actually add value.

The Problem

The insurance platform's support function had a structural inefficiency at its core. The majority of inbound support volume (inquiry types like "what is my current premium?", "what's the status of my claim?", "what does my policy cover for X?") were questions with deterministic answers. The answer existed in the database, it was specific to the user's policy, and it required no human judgment to retrieve and communicate. But every one of these inquiries followed the same path: submitted as a ticket, assigned to an agent queue, answered by a human reading from the same database the system already had access to.

The consequence was twofold. Human agents spent the majority of their time on work that was mechanical and repetitive (lookup and respond) rather than on complex cases that genuinely required their expertise: disputed claims, policy exceptions, escalation situations, user complaints requiring judgment and empathy. And users waited for responses that a system with database access could have delivered instantly.

The business case for an AI chatbot was obvious. The engineering challenge was building one that was safe to deploy on a production insurance platform: an environment where a hallucinated policy detail or a confidently wrong answer about a claim status carries real consequences for the user.

Goals

Handle the high-volume repetitive inquiry categories autonomously (policy status, premium inquiries, claim status, coverage questions, billing queries) without human agent involvement.
Build on a transformer-based NLU layer to classify incoming intent and extract relevant entities (policy number, claim ID, coverage type, date range) before routing to the appropriate response pipeline.
Use LangChain and OpenAI for complex, multi-turn policy inquiry workflows where the answer requires reasoning across multiple documents, but deliberately avoid generative responses for factual, data-backed inquiry types where accuracy is non-negotiable.
Back conversation state with Redis so sessions survive connection interruptions, support multi-turn interactions, and carry full context into human agent handoffs.
Implement a confidence-threshold escalation system: below a defined confidence level, route to a human agent with the complete conversation history. No cold transfers that force users to repeat themselves.
Shadow-deploy against real production traffic before cutover to validate accuracy against ground truth before any user was exposed to the system.

Technical Architecture

System Overview

The chatbot system sits between the user-facing frontend and the platform's existing backend services. It does not replace the backend. It wraps it. All factual data (policy details, claim status, premium records, billing history) is retrieved from the platform's existing services through internal API calls. The chatbot's job is to handle the conversation, classify what the user is asking, retrieve the right data, compose a response, and decide whether to answer or escalate.

The system has four primary layers: the WebSocket conversation layer, the NLU classification layer, the response pipeline layer (split between templated retrieval and RAG), and the session and escalation layer.

WebSocket Conversation Layer

The conversation interface is backed by WebSocket connections managed through Spring Boot's WebSocket support (@EnableWebSocketMessageBroker). Persistent connections rather than HTTP request-response cycles were the right choice for a conversational interface for two reasons: latency and state.

WebSocket eliminates the round-trip overhead of HTTP for every message exchange, which is important when the NLU pipeline adds processing latency and the user is already waiting. More importantly, the connection itself is a natural session boundary. When a WebSocket connection closes, the session is over. When it's open, the session is active. This maps cleanly to conversation lifecycle management and simplifies the Redis session TTL logic: a session's Redis entry expires if it hasn't received a message within a configurable window, and the WebSocket disconnection event triggers a session finalization that writes the conversation summary to the audit log.

The WebSocket handler receives incoming messages, deserializes them into a ConversationMessage struct, looks up or initializes the Redis session for that connection, and dispatches to the NLU layer. The response path is asynchronous: the handler returns immediately while the NLU and response pipelines run on a separate thread pool, and the final response is pushed back to the client through the WebSocket connection when ready. This prevents long-running inference calls from blocking the WebSocket handler thread.

NLU Classification Layer: Transformer-Based Intent and Entity Extraction

Every incoming message passes through the NLU layer before any response logic runs. The NLU layer does two things: intent classification and entity extraction.

NLU classification layer: every message is classified for intent and has entities extracted before any response logic runs

Intent classification determines what the user is asking. The classifier is a fine-tuned model built on transformer-based embeddings, trained on the platform's historical support ticket data: a labeled dataset of real user inquiries mapped to intent categories. The intent taxonomy covers the platform's support volume:

Intent Category	Example Queries
`policy.status`	"Is my policy active?", "When does my policy expire?"
`policy.coverage`	"What does my plan cover?", "Am I covered for X?"
`premium.inquiry`	"What is my current monthly premium?", "Why did my premium change?"
`claim.status`	"What's the status of my claim?", "When will my claim be processed?"
`claim.submission`	"How do I file a claim?", "What documents do I need?"
`billing.inquiry`	"What is my outstanding balance?", "When is my next payment due?"
`policy.change`	"How do I update my coverage?", "Can I add a dependent?"
`policy.cancellation`	"How do I cancel my policy?", "What is the cancellation penalty?"
`out_of_scope`	Anything outside the platform's support domain

Training on historical ticket data (rather than synthetic examples) gave the classifier strong performance on the actual distribution of phrasing and vocabulary the platform's users produce. Users asking about claim status don't always say "claim status." They say "my claim," "the accident claim I submitted," "case number 1234," "the one from last March." The historical data covered that variance.

The classifier outputs a predicted intent and a confidence score. Both are used downstream: the intent routes the message to the correct response pipeline, and the confidence score feeds the escalation decision.

Entity extraction runs in parallel with intent classification. It identifies the specific entities referenced in the user's message: policy number, claim ID, coverage type, date range, dependent name, payment method. Entities are extracted using a combination of regex patterns for structured identifiers (policy numbers, claim IDs follow known formats) and NER over the transformer embeddings for semantic entities (coverage types, date expressions). The extracted entities are stored in the session's Redis record and used to parameterize the data retrieval calls in the response pipeline.

Response Pipeline: Templated Retrieval vs. RAG

The response pipeline is deliberately split into two tracks based on intent type: templated retrieval for factual, data-backed inquiries, and LangChain/OpenAI RAG for complex, document-grounded policy inquiries.

Track 1: Templated Retrieval (factual intents)

For intent categories where the answer is a deterministic lookup (policy status, premium amount, claim status, billing balance), the system retrieves the relevant data from the platform's internal APIs and populates a structured response template. The response for a premium.inquiry intent looks like:

"Your current monthly premium for Policy [POLICY_NUMBER] is $[AMOUNT], effective [DATE]. Your next payment of $[AMOUNT] is due on [DATE]."

This is explicitly not generative. The template is fixed. The values are pulled from the database. The model has no role in composing the response. It classified the intent and extracted the entities, and the rest is deterministic retrieval and string interpolation.

This was a deliberate and contested design decision. The initial product direction favored a fully generative approach: let the LLM compose the response in natural language using the retrieved data as context. The argument was fluency: a generated response reads more naturally than a template.

The counter-argument, which I made and ultimately prevailed, was accuracy and auditability. In an insurance context, the specificity of a response matters in a way it doesn't in a general-purpose assistant. If the model says "your premium is approximately $450" when the actual value is $447.23, that's not a fluency difference. It's a factually wrong response about a contractual financial obligation. Template-based responses eliminate that risk entirely. Every factual statement in the response comes directly from a database read, with no model inference involved in the value. And the response is fully auditable: you can trace every field in the output back to the exact API call that produced it.

The shadow mode data validated this decision. Templated retrieval responses for factual intents had zero factual errors, by construction. Generative responses sampled during the shadow period had a measurable hallucination rate on specific values (amounts, dates, policy terms) that would have been unacceptable in production.

Track 2: LangChain / OpenAI RAG (complex policy inquiries)

For intent categories where the answer requires reasoning across policy documents (coverage inquiries, policy change implications, cancellation terms), the system uses a RAG pipeline built with LangChain and OpenAI.

The retrieval layer indexes the platform's policy document corpus (product guides, terms and conditions, coverage schedules, FAQ documents) in a vector store. Incoming policy.coverage and policy.change intents trigger a semantic search over this index, retrieving the most relevant document chunks for the user's specific question. The retrieved chunks, the user's question, and the extracted entities are composed into a structured prompt passed to the OpenAI model.

The LangChain chain is configured with explicit grounding instructions: the model is told to answer only from the retrieved document content, to cite the specific policy document section its answer draws from, and to flag when the retrieved content does not contain a clear answer rather than inferring beyond the documents. This grounding constraint (combined with the confidence threshold mechanism described below) keeps the RAG track safe for an insurance context while enabling genuinely useful responses to complex coverage questions that a template cannot handle.

Redis Session State

Every active conversation is backed by a Redis session keyed on the WebSocket connection ID. The session record stores:

Conversation history. The full message thread, user and system turns, in order.
Extracted entities. All entities identified across the conversation, accumulated across turns.
Classified intents. The intent classification for each user turn.
Escalation state. Whether the session has been routed to a human agent.
Agent context. If escalated, which agent received the handoff and when.

Redis was chosen over in-memory session state for two reasons: connection resilience and agent handoff. If a WebSocket connection drops and the user reconnects, the session is restored from Redis: the conversation continues rather than restarting. More importantly, when a session escalates to a human agent, the agent's dashboard pulls the full conversation history and entity list from Redis. The agent sees exactly what the user said, what the chatbot understood, what data was retrieved, and why the escalation was triggered, before typing their first response. No cold transfer, no "can you tell me your policy number again?"

Session TTL is set to 30 minutes of inactivity. Sessions that escalate to a human agent have their TTL extended to 4 hours to cover longer agent interaction windows.

Confidence-Threshold Escalation

The escalation mechanism is the system's most important safety property. Three conditions trigger escalation to a human agent:

Low intent confidence. The NLU classifier's confidence score falls below the configured threshold (tuned during shadow mode). The message is ambiguous or the classifier is uncertain.
Entity extraction failure. A required entity for the classified intent could not be extracted. A claim.status intent with no claim ID identified cannot retrieve meaningful data.
Retrieval failure or empty result. The backend API call returned an error or empty result for a factual intent, or the RAG retrieval returned no relevant document chunks above the similarity threshold for a complex intent.

In any of these cases, the chatbot does not attempt to answer. It sends the user a transition message, marks the session as escalated in Redis, and routes to the human agent queue, passing the complete session record as context. The agent sees the failure reason alongside the full conversation history.

The confidence threshold itself was the subject of the internal disagreement during development. The initial product direction was to set no threshold: let the chatbot attempt every response, on the theory that a low-confidence response might still be helpful. My position was that a wrong answer about an insurance policy is categorically worse than an escalation: it damages user trust, creates liability, and potentially causes the user to make incorrect decisions about their coverage. The data from the shadow period settled the argument: low-confidence responses had a substantially higher error rate than high-confidence ones, and the error rate above the threshold was within acceptable bounds. The threshold was set where the shadow data indicated the inflection point between acceptable and unacceptable accuracy.

Shadow Mode Rollout

Before any user was exposed to the chatbot, the system ran in shadow mode for two weeks. In shadow mode, every real support inquiry that came through the existing ticket system was simultaneously routed to the chatbot pipeline. The chatbot processed the inquiry, generated a response, and logged it, but the response was never shown to the user. The actual support workflow continued unchanged.

The shadow period produced two weeks of chatbot responses alongside ground-truth agent responses to the same queries. Each pair was evaluated for accuracy: did the chatbot response contain the correct policy data? Did the intent classification match what the agent's response indicated the user was actually asking? Was the escalation decision (would-have-escalated vs. would-have-answered) correct?

Shadow mode identified two systematic issues before they could affect users. First, a class of premium inquiry phrasings that the NLU classifier consistently misrouted to policy.status rather than premium.inquiry: a training data gap that was fixed by augmenting the training set with the misclassified examples and retraining. Second, an edge case in entity extraction where policy numbers containing a specific format variant (a legacy format used by one product line) weren't matched by the regex pattern: fixed by updating the extraction pattern and validating against the full policy number corpus.

Both issues were caught and corrected before cutover. The chatbot that went live had been validated against two weeks of real production traffic.

Key Technical Challenges

Challenge 1: The Generative vs. Retrieval Architecture Decision

The question of whether to use a generative LLM for all response types was the most consequential architectural decision in the project, and it was not a technical question. It was a product and risk question that required a technical answer.

The argument for fully generative responses was fluency and flexibility. The argument against was auditability and accuracy in a domain where factual errors carry real consequences. The resolution was to split the response pipeline by risk profile: low-risk, document-grounded complex inquiries could tolerate RAG-generated responses with explicit grounding constraints; high-risk factual data responses (amounts, dates, status values) could not tolerate any generative inference and used templated retrieval exclusively.

Two-track response pipeline: factual data answered by templated retrieval with zero generation; complex document-grounded inquiries answered by RAG with grounding

This hybrid architecture required more engineering than a single unified approach, but it was the only design that was safe to deploy on an insurance platform. The shadow mode data confirmed it: zero factual errors on templated responses, acceptable error rates on RAG responses with grounding, and an unacceptable hallucination rate on the generative-only responses that were sampled as a comparison baseline.

Challenge 2: Training the NLU Classifier on Real User Phrasing

Transformer-based intent classifiers trained on synthetic or curated examples perform poorly in production because real users don't ask questions the way documentation examples suggest. "What is my current monthly premium?" is the example. "How much am I paying you guys" is what a real user says.

The historical ticket data addressed this, but it required careful preparation. The ticket corpus contained noise: mislabeled tickets, tickets spanning multiple intents, tickets in incomplete sentences. Building a clean labeled training set from messy historical data required a manual labeling pass on a sample to establish ground truth, a quality filter to remove ambiguous or multi-intent tickets from the training set, and intentional augmentation of underrepresented intent categories to prevent the classifier from being biased toward the highest-volume intents.

The resulting classifier performed significantly better on real production traffic during shadow mode than a comparable classifier trained on synthetic examples evaluated during development, which was the point of using historical data in the first place.

Challenge 3: Escalation Context That Actually Helps Agents

The naive implementation of human escalation is a warm transfer: the user is connected to an agent and has to re-explain everything. This is a known frustration point and partially negates the efficiency benefit of having a chatbot triage the queue.

The Redis session record was specifically designed to make agent handoffs useful rather than ceremonial. The agent dashboard pulls the full session record on connection and displays: a structured summary of what the user is asking (the classified intents and extracted entities), the complete message thread, what data the chatbot retrieved (if any) before deciding to escalate, and the specific escalation reason (low confidence / entity failure / retrieval failure). An agent receiving a handoff knows immediately what the user wants, what information the system already has, and why the chatbot couldn't complete the interaction, without asking the user a single question to re-establish context.

Confidence-threshold escalation: below the confidence threshold the bot hands off to a human agent with the full conversation context, never a cold transfer

This required the session record schema to be designed with the agent dashboard as a consumer, not just as internal chatbot state. The fields, their types, and their naming were specified against the agent UI requirements, not against what was convenient for the chatbot pipeline to produce.

Outcome and What It Demonstrates

The chatbot handled the majority of the platform's high-volume repetitive inquiry categories autonomously from day one of live deployment, delivering a 60% reduction in support ticket volume. Human agents shifted their time to complex, judgment-dependent cases (disputed claims, policy exception requests, escalated complaints) where their expertise actually mattered.

The two-week shadow mode validated the system before any user exposure and caught two systematic issues that would have degraded production accuracy if the system had gone live without it.

From an engineering standpoint, the project demonstrates:

Production AI System Design. Building an AI system that handles real user traffic on a high-stakes platform requires architectural decisions that go beyond getting the model working: hybrid response pipelines based on risk profile, confidence-gated escalation, shadow deployment for pre-cutover validation. These are production engineering concerns, not research concerns.

Risk-Calibrated Architecture. The deliberate choice to use templated retrieval for factual responses and RAG only for document-grounded complex inquiries reflects an understanding that AI accuracy requirements vary by consequence. Wrong answers about policy amounts are not equivalent to wrong answers about general product information. The architecture encodes that difference.

Training Data as Engineering Work. Preparing the historical ticket corpus into a clean, balanced, quality-filtered training set for the NLU classifier was as much engineering work as building the classifier itself. The quality of the training data directly determined the quality of the production system.

Escalation as a Product Feature, Not a Fallback. Designing the escalation path (the confidence threshold logic, the session context payload, the agent dashboard integration) with the same care as the happy path reflects understanding that the failure mode of an AI system is as important as its success mode. A well-designed escalation path is what makes an AI system safe to deploy in a context where errors have consequences.

Tech Stack Summary

Layer	Technology
Backend Framework	Java · Spring Boot
Real-Time Communication	WebSocket (`@EnableWebSocketMessageBroker`)
NLU Layer	Fine-tuned transformer-based classifier (intent + NER)
RAG Pipeline	LangChain · OpenAI API · Vector store (semantic search)
Session State	Redis (conversation history, entities, escalation state)
Response Layer	Templated retrieval (factual) + RAG (complex policy inquiry)
Training Data	Historical support ticket corpus (labeled + augmented)
Deployment Strategy	Shadow mode (2 weeks) → live cutover
Platform	Enterprise insurance platform (1M+ users)