Wiki: Docs2db Generated: 2026-02-13
Relevant source files The following files were used as context for generating this wiki page: - [src/docs2db/chunks.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/chunks.py) - [src/docs2db/docs2db.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/docs2db.py) - [src/docs2db/database.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/database.py) - [src/docs2db/ingest.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/ingest.py) - [README.md](https://github.com/b08x/docs2db/blob/main/README.md)

Embedding Models

Introduction

Embedding models in the docs2db system serve as the mechanism for converting textual content into vector representations that enable semantic similarity search within a PostgreSQL database. The system supports multiple embedding models through a configurable provider architecture, allowing users to select different embedding models based on their accuracy, performance, and resource requirements. The embedding process occurs after document ingestion and chunking, transforming the processed text chunks into numerical vectors that are stored alongside metadata for hybrid retrieval operations combining vector similarity with traditional BM25 full-text search.

The embedding workflow integrates with the broader RAG (Retrieval-Augmented Generation) pipeline, receiving input from the chunking stage and producing output that the database loading stage consumes. This architectural positioning means embedding models act as a critical transformation layer between raw text processing and the final searchable vector database.

Architecture Overview

Data Flow Architecture

The embedding system follows a sequential pipeline architecture where each stage produces artifacts consumed by the subsequent stage. The complete flow begins with document ingestion, proceeds through chunking with optional contextual enrichment, generates embeddings from the processed chunks, and culminates with database loading.

Processing Pipeline Sequence

The embedding operation is invoked through the CLI command docs2db embed or automatically through the docs2db pipeline command. The system uses a batch processing architecture that distributes embedding computation across multiple workers for improved throughput.

Supported Embedding Models

The docs2db system supports four embedding models, each identified by a specific configuration key used throughout the codebase. The model selection occurs through command-line parameters or configuration settings, with the default model being granite-30m-english.

Model Identifier File Suffix Description
ibm-granite/granite-embedding-30m-english gran.json Default model, 30M parameter English-specific embedding
intfloat/e5-small-v2 e5sm.json Small efficient encoder model
ibm-granite/slate-125m-english slate.json 125M parameter slate model
bnlplab/noinstruct-small noin.json No-instruction variant small model

Sources: README.md#L1-L100, database.py#L1-L50

The model selection determines both the embedding file naming convention and the vector dimensionality stored in the database. Each model produces vectors of specific dimensionality that must be compatible with the database schema’s embedding column definition.

Configuration System

Embedding Configuration Structure

The embedding models are configured through a centralized configuration system that defines model-specific parameters including dimensionality, batch sizes, and API endpoints. The configuration is accessed through the EMBEDDING_CONFIGS dictionary in the database module.

# Configuration pattern observed in database.py

if model not in EMBEDDING_CONFIGS:
    available = ", ".join(EMBEDDING_CONFIGS.keys())
    logger.error(f"Unknown model '{model}'. ")

Sources: database.py#L200-L210

CLI Configuration Options

The embedding command supports multiple configuration parameters that override or supplement the default settings. These parameters enable users to customize model selection, file patterns, processing behavior, and database connection details.

docs2db embed --model granite-30m-english
docs2db embed --pattern "docs/**"
docs2db embed --content-dir my-content

Sources: docs2db.py#L1-L50, README.md#L50-L80

The following table enumerates the available CLI options for the embed command:

Parameter Type Default Purpose
--model string ibm-granite/granite-embedding-30m-english Embedding model identifier
--pattern string ** Directory glob pattern for content files
--content-dir string docs2db_content/ Base content directory path
--force boolean false Force regeneration of existing embeddings
--host string auto-detect Database host override
--port integer auto-detect Database port override
--db string auto-detect Database name override
--user string auto-detect Database user override
--password string auto-detect Database password override
--batch-size integer varies Files processed per worker batch

Sources: docs2db.py#L80-L150

Processing Implementation

Batch Processing Architecture

The embedding system uses a BatchProcessor class that distributes work across multiple worker processes. This architecture enables parallel processing of embedding generation, significantly improving throughput for large document collections.

# BatchProcessor initialization for embedding

processor = BatchProcessor(
    worker_function=process_embedding_batch,
    worker_args=(model, host, port, db, user, password),
    progress_message="Generating embeddings...",
    batch_size=batch_size,
    mem_threshold_mb=2000,
    max_workers=max_workers,
)

Sources: database.py#L250-L280

The batch processor tracks errors and successful operations, returning counts of processed files and any failures that occurred during the embedding generation phase.

File Processing Flow

Each document’s embedding generation follows a specific file naming convention based on the selected model. The system reads the chunks.json file produced by the previous chunking stage and generates corresponding embedding vectors stored in model-specific files.

Sources: database.py#L150-L200, chunks.py#L300-L350

The embedding files are named according to the model family: gran.json for granite models, slate.json for slate models, e5sm.json for E5 models, and so forth. This naming convention allows the system to identify which model generated each embedding file.

Database Schema Integration

Vector Storage Structure

The embedding vectors are stored in a dedicated table that links to the chunks table through foreign key relationships. The database uses the pgvector extension to enable efficient similarity search operations on the embedding data.

-- Schema relationship (inferred from load operations)
INSERT INTO documents (...) VALUES (...)
INSERT INTO chunks (document_id, ...) VALUES (...)
INSERT INTO embeddings (chunk_id, vector, model_name) VALUES (...)

Sources: database.py#L180-L220

Metadata Tracking

Each embedding operation records metadata including the model used, processing timestamps, and chunk associations. This metadata enables auditing and supports features like model version tracking and reprocessing determination.

# Metadata captured during embedding

processing_metadata = {
    "embedder": model,
    "parameters": {...},
    "embedding_dimension": dim,
}

Sources: database.py#L220-L250

Provider Integration

Model Inference Mechanism

The embedding generation uses a model inference abstraction that supports different backends. The system uses the HuggingFace Transformers library for local embedding generation, with the model loaded based on the configuration for each supported model type.

# Model loading pattern

if model not in EMBEDDING_CONFIGS:
    raise ConfigurationError(f"Unknown model '{model}'")

Sources: database.py#L200-L210

Configuration Validation

The system validates model selection before processing begins, ensuring that only supported models can be used for embedding generation. This validation prevents runtime errors and provides clear feedback to users about available options.

Chunking Dependency

The embedding system depends entirely on the chunking stage’s output. The chunking process, controlled by configuration in chunks.py, determines both the text segments to embed and any contextual enrichment applied before embedding generation.

# Chunker configuration affects embedding input

CHUNKING_CONFIG = {
    "chunker_class": ...,
    "max_tokens": ...,
    "merge_peers": ...,
    "tokenizer_model": ...,
}

Sources: chunks.py#L400-L420

The contextual enrichment feature, when enabled, uses LLM providers (OpenAI, WatsonX, OpenRouter, Mistral) to generate semantic context for each chunk before embedding. This enriched text serves as input to the embedding model, potentially improving retrieval accuracy.

Performance Considerations

Parallelization Strategy

The system uses multiprocessing for embedding generation, with configurable worker counts. The default worker count derives from settings, but can be overridden via CLI parameters to match available hardware resources.

# Worker configuration

max_workers = workers if workers is not None else settings.embedding_workers

Sources: docs2db.py#L150-L180

Memory Management

The BatchProcessor includes memory threshold configuration (mem_threshold_mb=2000) to prevent excessive memory consumption during batch processing. This safeguard ensures stable operation on resource-constrained systems.

Conclusion

Embedding models constitute a fundamental transformation layer within the docs2db RAG pipeline, converting processed textual chunks into vector representations that enable semantic search capabilities. The system’s architecture supports multiple embedding models through a consistent configuration interface, with the pgvector extension providing the database-level infrastructure for efficient similarity operations. The processing pipeline integrates with preceding chunking and succeeding database loading stages through well-defined file artifacts, enabling both batch processing and incremental updates. Configuration options provide flexibility in model selection, worker parallelization, and database connectivity while maintaining sensible defaults that prioritize the granite-30m-english model as the standard option.

Navigation