1. Introduction to RAG Architecture
Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications by combining the power of large language models with domain-specific knowledge bases. At Echo Algori Data, we've implemented over 500 RAG systems for Norwegian businesses, and Supabase with pgvector has emerged as our go-to solution.
RAG systems solve the fundamental limitation of LLMs: their knowledge cutoff and lack of proprietary data. By retrieving relevant context from a vector database before generation, we can create AI systems that are both accurate and current.
Why Supabase for RAG?
- ✓Built-in pgvector: Native PostgreSQL extension for vector operations
- ✓Real-time subscriptions: Live updates when knowledge base changes
- ✓Row-level security: GDPR-compliant data isolation
- ✓Edge functions: Serverless embedding generation
- ✓Cost effective: 10x cheaper than dedicated vector databases
2. Setting Up Supabase with pgvector
Let's start by setting up a Supabase project with pgvector enabled. This process takes about 5 minutes and gives you a production-ready vector database.
-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create documents table with embedding column
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
metadata JSONB,
embedding vector(1536), -- OpenAI ada-002 dimensions
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create HNSW index for fast similarity search
CREATE INDEX documents_embedding_idx ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Create function for similarity search
CREATE OR REPLACE FUNCTION search_documents(
query_embedding vector(1536),
match_count INT DEFAULT 5,
filter JSONB DEFAULT '{}'::JSONB
)
RETURNS TABLE(
id UUID,
content TEXT,
metadata JSONB,
similarity FLOAT
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY
SELECT
d.id,
d.content,
d.metadata,
1 - (d.embedding <=> query_embedding) AS similarity
FROM documents d
WHERE d.metadata @> filter
ORDER BY d.embedding <=> query_embedding
LIMIT match_count;
END;
$$;💡 Pro Tip
HNSW (Hierarchical Navigable Small World) indexes provide 95% accuracy with 10x speed improvement over IVFFlat indexes for datasets under 1M vectors.
3. Generating and Storing Embeddings
The quality of your RAG system depends heavily on embedding generation. We'll use OpenAI's ada-002 model, which offers the best balance of performance and cost for most use cases.
// TypeScript implementation with Supabase client
import { createClient } from '@supabase/supabase-js';
import OpenAI from 'openai';
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_ANON_KEY!
);
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY!
});
async function generateEmbedding(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: 'text-embedding-ada-002',
input: text
});
return response.data[0].embedding;
}
async function indexDocument(
content: string,
metadata: Record<string, any> = {}
) {
// Chunk long content for better retrieval
const chunks = chunkText(content, 500, 100); // 500 tokens, 100 overlap
for (const chunk of chunks) {
const embedding = await generateEmbedding(chunk);
const { error } = await supabase
.from('documents')
.insert({
content: chunk,
metadata: {
...metadata,
source: 'manual_upload',
chunk_index: chunks.indexOf(chunk),
total_chunks: chunks.length
},
embedding
});
if (error) throw error;
}
}
function chunkText(
text: string,
maxTokens: number = 500,
overlap: number = 100
): string[] {
// Simple word-based chunking - use tiktoken for production
const words = text.split(' ');
const chunks: string[] = [];
for (let i = 0; i < words.length; i += (maxTokens - overlap)) {
const chunk = words.slice(i, i + maxTokens).join(' ');
if (chunk.length > 50) chunks.push(chunk);
}
return chunks;
}🎯 Best Practice
Always implement overlapping chunks to maintain context across boundaries. We've found 20% overlap (100 tokens for 500-token chunks) provides optimal retrieval accuracy.
4. Implementing Similarity Search
With embeddings stored, we can now implement semantic search. The key is combining vector similarity with metadata filtering for precise retrieval.
// Advanced RAG retrieval with hybrid search
interface RAGQuery {
question: string;
filters?: Record<string, any>;
topK?: number;
threshold?: number;
}
async function retrieveContext(query: RAGQuery): Promise<string> {
const { question, filters = {}, topK = 5, threshold = 0.7 } = query;
// Generate embedding for the query
const queryEmbedding = await generateEmbedding(question);
// Perform similarity search
const { data: documents, error } = await supabase
.rpc('search_documents', {
query_embedding: queryEmbedding,
match_count: topK * 2, // Get more for re-ranking
filter: filters
});
if (error) throw error;
// Re-rank using cross-encoder for better accuracy
const reranked = await rerankDocuments(question, documents);
// Filter by similarity threshold
const relevant = reranked
.filter(doc => doc.similarity >= threshold)
.slice(0, topK);
// Combine chunks intelligently
const context = combineChunks(relevant);
return context;
}
async function rerankDocuments(
query: string,
documents: any[]
): Promise<any[]> {
// Use a cross-encoder model for re-ranking
// This significantly improves retrieval quality
const scores = await Promise.all(
documents.map(async doc => {
const score = await crossEncoderScore(query, doc.content);
return { ...doc, rerank_score: score };
})
);
return scores.sort((a, b) => b.rerank_score - a.rerank_score);
}
function combineChunks(documents: any[]): string {
// Group by source and maintain order
const grouped = documents.reduce((acc, doc) => {
const source = doc.metadata.source;
if (!acc[source]) acc[source] = [];
acc[source].push(doc);
return acc;
}, {} as Record<string, any[]>);
// Intelligently combine chunks
let context = '';
for (const [source, docs] of Object.entries(grouped)) {
const sorted = docs.sort((a, b) =>
a.metadata.chunk_index - b.metadata.chunk_index
);
context += `\n\nSource: ${source}\n`;
context += sorted.map(d => d.content).join('\n');
}
return context.trim();
}Similarity Metrics Comparison
| Metric | Speed | Accuracy | Use Case |
|---|---|---|---|
| Cosine | Fast | High | General purpose (recommended) |
| Euclidean | Fast | Medium | When magnitude matters |
| Inner Product | Fastest | High | Normalized vectors only |
5. Performance Optimization Techniques
After deploying hundreds of RAG systems, we've identified key optimization strategies that can improve performance by 10x.
🚀 Indexing Strategies
- • Use HNSW for <1M vectors
- • IVFFlat for 1M-10M vectors
- • Partition tables by date/category
- • Maintain multiple indexes for different query types
⚡ Query Optimization
- • Cache frequent embeddings
- • Batch similarity searches
- • Use connection pooling
- • Implement query result caching
💾 Storage Optimization
- • Compress embeddings with quantization
- • Archive old embeddings to cold storage
- • Use dimension reduction for non-critical data
- • Implement TTL for temporary embeddings
🎯 Accuracy Improvements
- • Hybrid search (vector + keyword)
- • Cross-encoder re-ranking
- • Query expansion techniques
- • Feedback loop for relevance
-- Performance monitoring query
WITH search_stats AS (
SELECT
date_trunc('hour', created_at) as hour,
COUNT(*) as searches,
AVG(similarity) as avg_similarity,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY similarity) as p95_similarity
FROM search_logs
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY hour
)
SELECT
hour,
searches,
ROUND(avg_similarity::numeric, 3) as avg_similarity,
ROUND(p95_similarity::numeric, 3) as p95_similarity,
CASE
WHEN avg_similarity < 0.7 THEN 'Poor'
WHEN avg_similarity < 0.8 THEN 'Fair'
WHEN avg_similarity < 0.9 THEN 'Good'
ELSE 'Excellent'
END as quality_rating
FROM search_stats
ORDER BY hour DESC;6. Production Deployment Strategies
Deploying RAG systems to production requires careful consideration of scalability, reliability, and compliance. Here's our battle-tested approach.
Production Checklist
// Production-ready RAG API endpoint
import { NextRequest, NextResponse } from 'next/server';
import { z } from 'zod';
import { RateLimiter } from '@/lib/rate-limiter';
import { metrics } from '@/lib/monitoring';
const querySchema = z.object({
question: z.string().min(1).max(500),
context: z.enum(['general', 'technical', 'legal']).optional(),
language: z.enum(['en', 'no', 'sv', 'da']).default('en')
});
const rateLimiter = new RateLimiter({
tokensPerInterval: 100,
interval: 'minute'
});
export async function POST(req: NextRequest) {
const startTime = Date.now();
try {
// Rate limiting
const identifier = req.headers.get('x-api-key') || req.ip;
if (!rateLimiter.check(identifier)) {
return NextResponse.json(
{ error: 'Rate limit exceeded' },
{ status: 429 }
);
}
// Validate input
const body = await req.json();
const { question, context, language } = querySchema.parse(body);
// Log request
await logRequest({ question, context, language, identifier });
// Retrieve context with caching
const cacheKey = `rag:${hashQuery(question)}:${context}:${language}`;
let retrievedContext = await cache.get(cacheKey);
if (!retrievedContext) {
retrievedContext = await retrieveContext({
question,
filters: { context, language },
topK: 5,
threshold: 0.75
});
// Cache for 1 hour
await cache.set(cacheKey, retrievedContext, 3600);
}
// Generate response
const response = await generateResponse(question, retrievedContext);
// Track metrics
const duration = Date.now() - startTime;
metrics.recordLatency('rag_query', duration);
metrics.incrementCounter('rag_queries_total');
// Return response with metadata
return NextResponse.json({
answer: response.answer,
sources: response.sources,
confidence: response.confidence,
metadata: {
processingTime: duration,
contextLength: retrievedContext.length,
model: 'gpt-4-turbo',
timestamp: new Date().toISOString()
}
});
} catch (error) {
metrics.incrementCounter('rag_errors_total');
console.error('RAG query error:', error);
if (error instanceof z.ZodError) {
return NextResponse.json(
{ error: 'Invalid input', details: error.errors },
{ status: 400 }
);
}
return NextResponse.json(
{ error: 'Internal server error' },
{ status: 500 }
);
}
}7. Real-World Case Studies
Let's examine three successful RAG implementations from our portfolio, each demonstrating different optimization strategies.
🏦 Nordic FinTech AS - Compliance Assistant
Challenge
10,000+ pages of financial regulations needed instant searchability with 100% accuracy for compliance queries.
Solution
Hybrid RAG system combining vector search with rule-based filters, cross-encoder re-ranking for legal precision.
Results
🏥 Oslo MedTech Solutions - Medical Knowledge Base
Challenge
Real-time access to medical literature with patient privacy compliance and multi-language support.
Solution
Multi-lingual embeddings, RLS for patient data isolation, edge functions for GDPR compliance.
Results
📦 Scandinavian Logistics - Supply Chain Intelligence
Challenge
Real-time tracking data from 10,000+ daily shipments needed instant analysis and pattern recognition.
Solution
Streaming embeddings pipeline, time-series aware retrieval, predictive analytics integration.
Results
8. Common Issues and Solutions
After hundreds of deployments, we've cataloged the most common RAG issues and their solutions.
🐛 Issue: Poor retrieval accuracy
Symptoms: Irrelevant documents returned, low similarity scores
Solutions:
- Implement chunking with overlap
- Use hybrid search (vector + keyword)
- Add cross-encoder re-ranking
- Fine-tune embedding model on domain data
- Adjust similarity threshold dynamically
⚡ Issue: Slow query performance
Symptoms: Queries taking >2 seconds, timeouts under load
Solutions:
- Add HNSW index with appropriate parameters
- Implement embedding caching layer
- Use connection pooling (pgBouncer)
- Partition large tables by date/category
- Pre-compute frequent query embeddings
💰 Issue: High operational costs
Symptoms: Embedding API costs exceeding budget, storage growing rapidly
Solutions:
- Batch embedding generation
- Implement smart caching strategies
- Use smaller embedding models for non-critical data
- Compress embeddings with quantization
- Archive old embeddings to cold storage
🔒 Issue: GDPR compliance concerns
Symptoms: User data in embeddings, deletion requests, audit requirements
Solutions:
- Implement row-level security (RLS)
- Separate PII from embeddings
- Create deletion workflows with cascading
- Maintain comprehensive audit logs
- Use European region deployments
Ready to Build Your RAG System?
With over 500 successful RAG implementations, Echo Algori Data is Norway's leading expert in Supabase-powered AI systems. Our team can help you design, implement, and optimize a production-ready RAG solution tailored to your specific needs.
Additional Resources
n8n vs Make: Ultimate Comparison
Detailed analysis of automation platforms for AI workflows
Read More →GDPR Compliance for AI Systems
Norwegian legal framework and implementation guide
Read More →RAG Starter Template
Open-source Supabase RAG template with best practices
View on GitHub →Free RAG Assessment
Get a custom analysis of your RAG requirements
Book Now →