Financial services organizations accumulate enormous volumes of proprietary text: deal memos, CIM summaries, loan agreements, board presentations, compliance documentation. The instinct is to fine-tune a language model on this corpus and treat it as a knowledge base. That instinct is usually wrong.
Fine-tuning bakes knowledge into model weights at a point in time. When a deal closes, a policy updates, or a loan covenant changes, the model has no mechanism to reflect that reality without full retraining. RAG — Retrieval Augmented Generation — inverts this: the model stays static and authoritative documents are retrieved dynamically at query time. The result is a system that is always current, always citable, and far easier to audit.
RAG answers with citations. Every response traces back to the specific document and passage that grounded it. In a regulated industry where "the model said so" is not an acceptable explanation, that auditability is not a nice-to-have — it is a requirement.
Before a document can be retrieved, it must be split into chunks small enough to embed meaningfully but large enough to carry context. For financial documents, naive fixed-size chunking produces poor retrieval results. A 512-token chunk that splits mid-sentence across a loan covenant removes exactly the context that makes the clause meaningful.
Three strategies are worth evaluating. Fixed-size chunking is fast and predictable but context-blind. Recursive text splitting with overlap — typically 50–100 tokens — preserves more coherence by splitting at paragraph and sentence boundaries first. Semantic chunking is the most accurate: it computes embedding similarity between adjacent sentences and splits only when semantic distance exceeds a threshold. For financial documents where a single section may span multiple pages, semantic chunking meaningfully improves retrieval precision.
The embedding model determines how well semantic similarity maps to actual document relevance. General-purpose models work adequately but underperform on domain-specific terminology. A query for "subordinated mezzanine yield" returns better results from a model trained on financial text than from one trained on general web data.
| Model | Dimensions | Best Use | Notes |
|---|---|---|---|
| text-embedding-3-small | 1,536 | General; cost-efficient | Good baseline; weaker on financial jargon |
| text-embedding-3-large | 3,072 | High-precision retrieval | Better recall; 5x cost of small |
| voyage-finance-2 | 1,024 | Financial documents | Purpose-built; best results on SEC filings and CIMs |
| nomic-embed-text-v1 | 768 | Self-hosted deployments | Open-source; runs locally; no API dependency |
PostgreSQL with the pgvector extension is the right choice for most financial services deployments. It keeps vector search inside a database that already handles your transactional workload, avoids a separate vector store dependency, and gives you full SQL expressiveness for metadata filtering — filtering by document date, deal type, or counterparty before the vector search runs.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id BIGSERIAL PRIMARY KEY,
doc_id TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
embedding vector(1024),
metadata JSONB,
created_at TIMESTAMPTZ DEFAULT now()
);
-- IVFFlat: lists = sqrt(total_rows) is the standard starting point
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
CREATE INDEX ON document_chunks (doc_id);
CREATE INDEX ON document_chunks USING GIN (metadata);
The RAGPipeline class below handles the four core operations: embedding text, indexing document chunks idempotently (deleting existing chunks for a doc_id before inserting), retrieving the most semantically similar chunks for a query, and generating a grounded answer with citations.
import psycopg2, psycopg2.extras
import anthropic
import voyageai
from typing import List, Dict
class RAGPipeline:
def __init__(self, conn_string: str):
self.db = psycopg2.connect(conn_string)
self.vo = voyageai.Client()
self.llm = anthropic.Anthropic()
def embed(self, texts: List[str]) -> List[List[float]]:
result = self.vo.embed(
texts, model="voyage-finance-2", input_type="document"
)
return result.embeddings
def index_chunks(self, doc_id: str, chunks: List[str], metadata: dict = None):
# Idempotent: delete stale chunks before re-indexing
with self.db.cursor() as cur:
cur.execute("DELETE FROM document_chunks WHERE doc_id = %s", (doc_id,))
embeddings = self.embed(chunks)
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
cur.execute(
"""INSERT INTO document_chunks
(doc_id, chunk_index, content, embedding, metadata)
VALUES (%s, %s, %s, %s, %s)""",
(doc_id, i, chunk, emb, psycopg2.extras.Json(metadata))
)
self.db.commit()
def retrieve(self, query: str, top_k: int = 5) -> List[Dict]:
q_emb = self.vo.embed(
[query], model="voyage-finance-2", input_type="query"
).embeddings[0]
with self.db.cursor() as cur:
cur.execute(
"""SELECT doc_id, chunk_index, content,
1 - (embedding <=> %s::vector) AS similarity
FROM document_chunks
ORDER BY embedding <=> %s::vector
LIMIT %s""",
(q_emb, q_emb, top_k)
)
return [
{'doc_id': r[0], 'chunk_index': r[1],
'content': r[2], 'similarity': r[3]}
for r in cur.fetchall()
]
def answer(self, query: str) -> Dict:
chunks = self.retrieve(query)
context = "\n\n".join(
f"[{c['doc_id']} chunk {c['chunk_index']}]\n{c['content']}"
for c in chunks
)
message = self.llm.messages.create(
model="claude-opus-4-7", max_tokens=1024,
messages=[{"role": "user", "content":
f"Answer using only these sources:\n\n{context}\n\nQuestion: {query}"}]
)
return {'answer': message.content[0].text, 'sources': chunks}
A pipeline that works in development diverges from one that works in production in several important ways. The most common failure mode is embedding drift: indexing documents with one model version and querying with another after an API update. Pin your embedding model version explicitly and version your indexes alongside your model configuration.
Chunk freshness is a second operational concern. Financial documents are amended, superseded, and revoked. Without a reindexing workflow triggered by document updates, your retrieval corpus drifts from your source of truth. The idempotent index_chunks method handles this cleanly — calling it on an updated document deletes stale chunks and reindexes from scratch.
Finally, retrieval quality degrades when top-k results include chunks with low similarity scores. Set a minimum similarity threshold — typically 0.65–0.75 for cosine similarity — and have the pipeline respond with "insufficient information in available documents" rather than hallucinate from weak context. In financial services, a confident wrong answer is far more dangerous than an honest admission that the documents do not contain the answer.
We design and build production RAG systems for financial services firms — document ingestion, embedding infrastructure, and grounded retrieval built to your compliance requirements.
Talk to Our Team