Béranger Thomas
Python NLP Embeddings Benchmark RAG Chunking Open Source

ForzaEmbed

ForzaEmbed

Python benchmarking framework for text embedding models: grid search over chunking strategies and similarity metrics, with textual heatmap and embedding space visualizations.

Context

Choosing an embedding model and a text chunking strategy for a RAG pipeline is rarely straightforward: performance varies with language, chunk size, overlap, and the similarity metric in use. ForzaEmbed automates this evaluation by running an exhaustive grid search across all these parameters, and produces interactive reports to visually analyze the quality of the resulting embeddings.

Architecture

The framework is built around three stages:

  1. Configuration expansion: from a YAML file, ForzaEmbed generates the Cartesian product of all parameters — embedding model, chunking strategy (langchain, raw, semchunk, nltk, spacy), chunk size, overlap, and similarity metric (cosine, euclidean, dot_product, etc.). Sentence-based chunkers (nltk, spacy) ignore size parameters, which eliminates up to 40% of redundant combinations.
  2. Execution and caching: for each combination, the text is chunked, embeddings are computed, and chunks are scored against user-defined keyword themes. Every result is cached in a SQLite database with intelligent quantization (float16 for embeddings, uint16 for similarities). Already-processed combinations are automatically skipped on resume.
  3. Interactive report: a standalone HTML file is generated with a textual heatmap (spans color-coded by thematic similarity), t-SNE/UMAP/PCA projections of chunk embeddings with original text tooltips, and a draggable floating similarity threshold slider to dim irrelevant passages. Evaluation metrics (silhouette score with intra/inter-cluster decomposition, computation time) are displayed per configuration.

Features

Impact

ForzaEmbed addresses a concrete question in RAG pipeline design: which model/chunking combination maximizes thematic coherence of embeddings on my documents? By making this evaluation systematic and visual, it turns what often remains an intuitive call into an evidence-based decision.

View demo → GitHub ↗