Home / Intelligence Log / Software Development Software Development

RAG Architecture for Financial Services: Building a Private Knowledge Engine

Why RAG Over Fine-Tuning for Financial Documents

Financial services organizations accumulate enormous volumes of proprietary text: deal memos, CIM summaries, loan agreements, board presentations, compliance documentation. The instinct is to fine-tune a language model on this corpus and treat it as a knowledge base. That instinct is usually wrong.

Fine-tuning bakes knowledge into model weights at a point in time. When a deal closes, a policy updates, or a loan covenant changes, the model has no mechanism to reflect that reality without full retraining. RAG — Retrieval Augmented Generation — inverts this: the model stays static and authoritative documents are retrieved dynamically at query time. The result is a system that is always current, always citable, and far easier to audit.

The Core Advantage

RAG answers with citations. Every response traces back to the specific document and passage that grounded it. In a regulated industry where "the model said so" is not an acceptable explanation, that auditability is not a nice-to-have — it is a requirement.

Chunking Strategy for Financial Text

Before a document can be retrieved, it must be split into chunks small enough to embed meaningfully but large enough to carry context. For financial documents, naive fixed-size chunking produces poor retrieval results. A 512-token chunk that splits mid-sentence across a loan covenant removes exactly the context that makes the clause meaningful.

Three strategies are worth evaluating. Fixed-size chunking is fast and predictable but context-blind. Recursive text splitting with overlap — typically 50–100 tokens — preserves more coherence by splitting at paragraph and sentence boundaries first. Semantic chunking is the most accurate: it computes embedding similarity between adjacent sentences and splits only when semantic distance exceeds a threshold. For financial documents where a single section may span multiple pages, semantic chunking meaningfully improves retrieval precision.

Embedding Model Comparison

The embedding model determines how well semantic similarity maps to actual document relevance. General-purpose models work adequately but underperform on domain-specific terminology. A query for "subordinated mezzanine yield" returns better results from a model trained on financial text than from one trained on general web data.

Model Dimensions Best Use Notes
text-embedding-3-small 1,536 General; cost-efficient Good baseline; weaker on financial jargon
text-embedding-3-large 3,072 High-precision retrieval Better recall; 5x cost of small
voyage-finance-2 1,024 Financial documents Purpose-built; best results on SEC filings and CIMs
nomic-embed-text-v1 768 Self-hosted deployments Open-source; runs locally; no API dependency

pgvector Schema and Indexing

PostgreSQL with the pgvector extension is the right choice for most financial services deployments. It keeps vector search inside a database that already handles your transactional workload, avoids a separate vector store dependency, and gives you full SQL expressiveness for metadata filtering — filtering by document date, deal type, or counterparty before the vector search runs.

SQL schema.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
    id          BIGSERIAL PRIMARY KEY,
    doc_id      TEXT NOT NULL,
    chunk_index INTEGER NOT NULL,
    content     TEXT NOT NULL,
    embedding   vector(1024),
    metadata    JSONB,
    created_at  TIMESTAMPTZ DEFAULT now()
);

-- IVFFlat: lists = sqrt(total_rows) is the standard starting point
CREATE INDEX ON document_chunks
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

CREATE INDEX ON document_chunks (doc_id);
CREATE INDEX ON document_chunks USING GIN (metadata);

Complete Pipeline Implementation

The RAGPipeline class below handles the four core operations: embedding text, indexing document chunks idempotently (deleting existing chunks for a doc_id before inserting), retrieving the most semantically similar chunks for a query, and generating a grounded answer with citations.

Python rag_pipeline.py
import psycopg2, psycopg2.extras
import anthropic
import voyageai
from typing import List, Dict

class RAGPipeline:
    def __init__(self, conn_string: str):
        self.db  = psycopg2.connect(conn_string)
        self.vo  = voyageai.Client()
        self.llm = anthropic.Anthropic()

    def embed(self, texts: List[str]) -> List[List[float]]:
        result = self.vo.embed(
            texts, model="voyage-finance-2", input_type="document"
        )
        return result.embeddings

    def index_chunks(self, doc_id: str, chunks: List[str], metadata: dict = None):
        # Idempotent: delete stale chunks before re-indexing
        with self.db.cursor() as cur:
            cur.execute("DELETE FROM document_chunks WHERE doc_id = %s", (doc_id,))
            embeddings = self.embed(chunks)
            for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
                cur.execute(
                    """INSERT INTO document_chunks
                       (doc_id, chunk_index, content, embedding, metadata)
                       VALUES (%s, %s, %s, %s, %s)""",
                    (doc_id, i, chunk, emb, psycopg2.extras.Json(metadata))
                )
        self.db.commit()

    def retrieve(self, query: str, top_k: int = 5) -> List[Dict]:
        q_emb = self.vo.embed(
            [query], model="voyage-finance-2", input_type="query"
        ).embeddings[0]
        with self.db.cursor() as cur:
            cur.execute(
                """SELECT doc_id, chunk_index, content,
                          1 - (embedding <=> %s::vector) AS similarity
                   FROM document_chunks
                   ORDER BY embedding <=> %s::vector
                   LIMIT %s""",
                (q_emb, q_emb, top_k)
            )
            return [
                {'doc_id': r[0], 'chunk_index': r[1],
                 'content': r[2], 'similarity': r[3]}
                for r in cur.fetchall()
            ]

    def answer(self, query: str) -> Dict:
        chunks = self.retrieve(query)
        context = "\n\n".join(
            f"[{c['doc_id']} chunk {c['chunk_index']}]\n{c['content']}"
            for c in chunks
        )
        message = self.llm.messages.create(
            model="claude-opus-4-7", max_tokens=1024,
            messages=[{"role": "user", "content":
                f"Answer using only these sources:\n\n{context}\n\nQuestion: {query}"}]
        )
        return {'answer': message.content[0].text, 'sources': chunks}

Production Considerations

A pipeline that works in development diverges from one that works in production in several important ways. The most common failure mode is embedding drift: indexing documents with one model version and querying with another after an API update. Pin your embedding model version explicitly and version your indexes alongside your model configuration.

Chunk freshness is a second operational concern. Financial documents are amended, superseded, and revoked. Without a reindexing workflow triggered by document updates, your retrieval corpus drifts from your source of truth. The idempotent index_chunks method handles this cleanly — calling it on an updated document deletes stale chunks and reindexes from scratch.

Finally, retrieval quality degrades when top-k results include chunks with low similarity scores. Set a minimum similarity threshold — typically 0.65–0.75 for cosine similarity — and have the pipeline respond with "insufficient information in available documents" rather than hallucinate from weak context. In financial services, a confident wrong answer is far more dangerous than an honest admission that the documents do not contain the answer.

Ready to Deploy a Private Knowledge Engine?

We design and build production RAG systems for financial services firms — document ingestion, embedding infrastructure, and grounded retrieval built to your compliance requirements.

Talk to Our Team