RAG without the platform rewrite
How to add retrieval over your existing data without standing up a separate vector platform or pausing the product roadmap.
Retrieval-augmented generation (RAG) is often sold as a new platform decision: pick a vector database, build an ingestion pipeline, deploy a separate search service, then wire a chat UI on top.
For most product teams, that is the wrong framing. You already have databases, APIs, search indexes, and authorization. RAG should plug into those boundaries — not replace them.
Separate platform (common pitch)
Integrated path (recommended)
Your app + auth
Existing session
Retrieval middleware
SQL, APIs, search you own
LLM + citations in UI
Embedded in product views
The integrated path reuses auth, data access, and deployment you already operate.
Why the separate-platform pitch is tempting
Vendors bundle vector storage, chunking, and a chat widget because it is easy to demo. For a greenfield project, that can be fine. For an existing product with paying customers, it creates problems:
- Duplicate auth — a sidecar search service does not know your tenant model
- Stale data — another pipeline to keep in sync with your source of truth
- Detached UX — users live in your app; a floating chat widget fights your workflow design
- Ops overhead — another system to monitor, secure, and on-call for
Integrated RAG treats retrieval as middleware in your application — same deployment, same identity, same observability.
Start from the user workflow
Before choosing Pinecone, pgvector, or Elasticsearch, define the feature in product terms:
- What question is the user trying to answer?
- What data do they already have permission to see?
- Where in the UI does the answer need to appear — inline, sidebar, modal, or action suggestion?
The retrieval layer should assemble context from sources your app already trusts: Postgres rows, document metadata, CRM records, ticket history, internal APIs — scoped per user and tenant.
A concrete example
A support copilot embedded in a ticket view should retrieve: the current ticket thread, the customer's plan tier, relevant help articles, and recent similar resolved tickets. It should not search the entire knowledge base without tenant filters or return answers without citations your agents can verify.
Middleware owns retrieval
Retrieval belongs on the server — after authentication, before the model call. A typical flow:
1. Authenticate
Session / JWT
2. Fetch
DB, APIs, docs
3. Rank & trim
Fit context window
4. Prompt + call
With citations
5. Render
Answer + sources in UI
Retrieval runs after auth — never trust the client to assemble context.
The middleware layer should:
- Authenticate the request using your existing session or token
- Fetch candidate context from stores the user can access
- Rank and trim to fit the model's context window — quality over quantity
- Attach citations the UI can render — source IDs, links, or snippets
- Log what was retrieved for debugging bad answers
Keeping retrieval server-side prevents clients from bypassing permission checks, makes caching straightforward, and gives support a trail when something goes wrong.
You do not need a greenfield vector stack on day one
Vector search helps at scale, especially for semantic matching over large unstructured corpora. Many integrations start simpler:
- Structured retrieval — SQL with filters (
tenant_id,status, date range) - API composition — aggregate context from services you already call
- Full-text search — Elasticsearch, Postgres
tsvector, or your existing search product - Hybrid — metadata filters plus keyword search before adding embeddings
Best when: Known queries, tabular data
Best when: Docs + metadata search
Best when: Semantic match at scale
Start left. Move right when structured retrieval stops working — not before.
When to add embeddings
Consider vectors when:
- Users ask questions that do not match document titles or keywords
- Your corpus is large enough that brute-force fetch is too slow or expensive
- You have eval data showing structured retrieval misses too often
Defer vectors when:
- Most queries map to known entities (accounts, orders, projects)
- Your content is already well-structured with metadata
- Team bandwidth is limited — embeddings add indexing, re-embedding on change, and reranking complexity
Citations are not optional
For B2B products, "the AI said so" is not acceptable. Citations build trust, help users verify answers, and give support a starting point for escalations.
Good citation UX:
- Links or IDs back to source records in your product
- Snippets that match what was actually sent to the model
- Clear distinction when no relevant context was found — refuse or ask clarifying questions instead of guessing
Common failure modes
| Failure | Symptom | Mitigation |
|---|---|---|
| Wrong tenant context | Cross-customer data leakage | Enforce tenant filter at fetch time, never in prompt alone |
| Stale documents | Outdated policy answers | Tie retrieval to source version; surface "last updated" in UI |
| Over-retrieval | Slow responses, high cost | Rank aggressively; cap chunks per source |
| Under-retrieval | Hallucinated fill-in | Eval retrieval hit rate; expand sources incrementally |
Ship a thin vertical slice
The biggest mistake is boiling the ocean: index every document, support every question type, launch a standalone chat. Instead:
One workflow
- Define user question
- Pick one data source
- Server retrieval
Harden
- Logging & evals
- Citation UI
- Feature flag
Expand
- More sources
- Hybrid search
- Vectors if needed
Ship one end-to-end path before adding data sources or infrastructure.
Pick one workflow, one primary data source, one UI surface. Get it behind a feature flag with logging and evals. Measure answer quality and latency with real users. Then expand retrieval sources and add semantic search only when the data proves you need it.
Eval questions for your first slice
- Does the answer cite the right source 80%+ of the time on a golden set?
- What happens when no relevant context exists?
- What is p95 latency end-to-end — retrieval plus generation?
- What does it cost per successful resolution at current traffic?
Operating RAG in production
RAG systems decay as content changes. Plan for:
- Re-indexing or refresh when source documents update
- Retrieval regression tests when you add new data sources
- Dashboards for retrieval latency, chunk count, and empty-result rate
- Feedback loops — thumbs down should tag the retrieval set for review
This is ongoing product operations, not a one-time integration project.
The integration mindset
RAG without the platform rewrite means: use your auth, your data access patterns, your deployment pipeline, and your UI. Add retrieval middleware and citations. Grow complexity only when measured need appears.
Want help scoping RAG for your stack? Get in touch with your auth model, data sources, and target workflow — we will map a thin-slice plan you can ship without pausing the roadmap.
Related resources
AI integration services — what we build and how we deliver
A practical overview of 475 Cumulus capabilities, engagement phases, and how we integrate LLM features into existing products without a platform rewrite.
What production-ready LLM integration actually means
A practical checklist for engineering leaders — beyond the demo and before you call an AI feature shipped.