Wiki: Docs2db Generated: 2026-02-13

Relevant source files

The following files were used as context for generating this wiki page: - [src/docs2db/chunks.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/chunks.py) - [src/docs2db/docs2db.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/docs2db.py) - [src/docs2db/multiproc.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/multiproc.py) - [src/docs2db/ingest.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/ingest.py) - [README.md](https://github.com/b08x/docs2db/blob/main/README.md) - [CHANGELOG.md](https://github.com/b08x/docs2db/blob/main/CHANGELOG.md)

Chunking and Contextual Retrieval

1. Introduction

Chunking and Contextual Retrieval constitutes the second major processing stage in the docs2db pipeline, following document ingestion. This mechanism transforms raw Docling JSON documents into semantically enriched text segments optimized for retrieval-augmented generation (RAG) applications.

The system operates through a hybrid approach combining structural analysis from Docling’s chunker with optional LLM-generated semantic context. Each document is split into chunks, and an optional contextual enrichment layer is applied using external LLM providers (OpenAI-compatible APIs, IBM WatsonX, OpenRouter, or Mistral). This contextual retrieval approach follows principles similar to Anthropic’s engineering methodology, where each chunk receives a brief semantic description situating it within the broader document.

The processing pipeline supports parallel execution via a BatchProcessor, enabling efficient handling of large document collections. The system maintains incremental processing capabilities, automatically skipping documents that have not changed since the last processing run.

Sources: src/docs2db/chunks.py#L1-L50, README.md#L1-L30

2. Architecture Overview

2.1 Pipeline Position

The chunking stage occupies the third position in the docs2db processing pipeline:

[1] Ingest → [2] Chunk → [3] Embed → [4] Load → Database

Each stage produces intermediate files stored in the content directory (docs2db_content/). The chunking stage reads source.json files produced by ingestion and generates chunks.json files containing text segments with optional semantic context.

Sources: src/docs2db/docs2db.py#L1-L30, README.md#L30-L50

2.2 Core Components

The chunking subsystem comprises five primary components:

Component	File	Responsibility
`generate_chunks`	`chunks.py`	Main CLI command and orchestration
`HybridChunker`	`chunks.py`	Docling-based document splitting
`LLMSession`	`chunks.py`	LLM provider abstraction and session management
LLM Providers	`chunks.py`	API clients for WatsonX, OpenRouter, Mistral
`BatchProcessor`	`multiproc.py`	Parallel file processing

Sources: src/docs2db/chunks.py#L100-L300, src/docs2db/multiproc.py#L1-L50

3. Document Chunking Mechanism

3.1 HybridChunker Usage

The system employs Docling’s HybridChunker for document segmentation. This chunker provides structural context including heading hierarchy, page numbers, and document organization. The chunker is initialized with a tokenizer and peer-merging enabled to produce coherent segments.

# From chunks.py - chunk generation logic

chunker = HybridChunker(tokenizer=get_tokenizer(), merge_peers=True)

for chunk in chunker.chunk(dl_doc=dl_doc):
    # Get structural context from docling
    structural_context_text = chunker.contextualize(chunk=chunk)
    chunk_text = structural_context_text.replace("\xa0", " ")

Sources: src/docs2db/chunks.py#L200-L220

3.2 Chunk Data Structure

Each chunk produces a structured dictionary containing multiple text representations:

chunk_data = {
    "text": chunk_text,              # Structural context + chunk text - shown to LLM
    "contextual_text": contextual_text,  # Semantic context + structural context + chunk text - for indexing
    "metadata": chunk.meta.model_dump(),
}

The text field contains structural context intended for display to language models during question answering. The contextual_text field combines semantic context (generated by LLM) with structural context and is used for vector embedding and retrieval.

Sources: src/docs2db/chunks.py#L230-L240

3.3 Processing Flow

The chunking process branches based on whether contextual enrichment is enabled. When skip_context is True, the system relies solely on Docling’s structural context. When False, an LLM generates semantic context for each chunk.

Sources: src/docs2db/chunks.py#L180-L250, README.md#L50-L70

4. Contextual Enrichment System

4.1 LLMSession Class

The LLMSession class manages the interaction between chunk generation and LLM providers. It handles provider initialization, document-level context management, summarization for large documents, and per-chunk context generation.

class LLMSession:
    def __init__(
        self,
        model: str,
        provider: str,
        openai_url: str | None = None,
        watsonx_url: str | None = None,
        openrouter_url: str | None = None,
        mistral_url: str = "https://api.mistral.ai/v1",
        context_limit_override: int | None = None,
        shared_state: dict | None = None,
    ):
        # Provider instantiation based on configuration

Sources: src/docs2db/chunks.py#L300-L380

4.2 Document Context Management

The session maintains document-level context and handles documents exceeding model token limits through summarization:

def set_document(self, doc_text: str):
    # Analyze document size
    doc_words = len(doc_text.split())
    doc_tokens = estimate_tokens(doc_text)
    
    # Determine model limits
    model_limit = self.context_limit_override or MODEL_CONTEXT_LIMITS.get(self.model, 32768)
    usable_limit = int(model_limit * CONTEXT_SAFETY_MARGIN)
    
    # Summarize if document exceeds limits
    if doc_tokens > usable_limit:
        doc_text = self._summarize_document(doc_text)

The system applies a safety margin (0.8 default) to prevent exceeding context windows. When documents exceed usable limits, they are summarized to fit within available context.

Sources: src/docs2db/chunks.py#L400-L450

4.3 Chunk Context Generation

For each chunk, the system generates a succinct semantic context:

def get_chunk_context(self, chunk_text: str) -> str:
    chunk_prompt = f"""Here is a chunk from the document:
<chunk>
{chunk_text}
</chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
    
    return self.provider.get_chunk_context(chunk_prompt)

The prompt explicitly instructs the LLM to provide only the contextual description without additional commentary, ensuring clean context for retrieval.

Sources: src/docs2db/chunks.py#L460-L475

5. LLM Provider Implementation

5.1 Provider Architecture

The system implements a provider abstraction pattern with a base LLMProvider class and concrete implementations for each service:

Each provider implements the get_chunk_context method with provider-specific API calls while exposing a unified interface.

Sources: src/docs2db/chunks.py#L500-L700

5.2 WatsonX Provider

The WatsonX provider uses IBM’s WatsonX SDK for model inference:

class WatsonXProvider(LLMProvider):
    def __init__(self, api_key: str, project_id: str, url: str, model: str, shared_state: dict | None = None):
        credentials = Credentials(api_key=api_key, url=url)
        self.api_client = APIClient(credentials=credentials, project_id=project_id)
        self.model_inference = ModelInference(model_id=model, api_client=self.api_client)
    
    def get_chunk_context(self, chunk_prompt: str) -> str:
        messages = [
            {"role": "system", "content": "You are an expert at providing concise context..."},
            {"role": "user", "content": f"<document>\n{self.doc_text}\n</document>"},
            {"role": "assistant", "content": "I have read the document..."},
            {"role": "user", "content": chunk_prompt},
        ]
        response = self.model_inference.chat(messages=messages, params=params)
        return response["choices"][0]["message"]["content"].strip()

The provider maintains conversation history with the LLM, reading the full document once and then answering chunk-specific context queries.

Sources: src/docs2db/chunks.py#L520-L580

5.3 OpenRouter Provider

OpenRouter provides access to multiple models through a unified OpenAI-compatible API:

class OpenRouterProvider(LLMProvider):
    def __init__(self, base_url: str, model: str, api_key: str, site_url: str | None = None, 
                 app_name: str | None = None, shared_state: dict | None = None):
        super().__init__(shared_state=shared_state)
        self.base_url = base_url.rstrip("/")
        self.model = model
        self.api_key = api_key
        self.site_url = site_url
        self.app_name = app_name
        self.client = httpx.Client(timeout=600.0)

Sources: src/docs2db/chunks.py#L600-L650

5.4 Provider Configuration Parameters

Parameter	WatsonX	OpenRouter	Mistral	Description
`api_key`	Required	Required	Required	Service-specific API key
`model`	Required	Required	Required	Model identifier
`base_url`	N/A	Required	Optional	API endpoint URL
`project_id`	Required	N/A	N/A	WatsonX project ID
`site_url`	N/A	Optional	N/A	Referrer for OpenRouter
`app_name`	N/A	Optional	N/A	Application name for OpenRouter

Sources: src/docs2db/chunks.py#L520-L700

6. Parallel Processing

6.1 BatchProcessor Integration

The chunking command uses BatchProcessor from multiproc.py to process multiple documents in parallel:

chunker = BatchProcessor(
    worker_function=generate_chunks_batch,
    worker_args=(
        content_dir,
        force,
        skip_context,
        context_model,
        provider,
        openai_url,
        watsonx_url,
        openrouter_url,
        mistral_url,
        context_limit_override,
    ),
    progress_message="Chunking files...",
    batch_size=1,
    mem_threshold_mb=2000,
    max_workers=max_workers,
    use_shared_state=True,
)

Sources: src/docs2db/docs2db.py#L150-L180

6.2 Shared State for Rate Limiting

The system uses shared state to coordinate rate limiting across parallel workers:

shared_state = {"watsonx": 0, "mistral": 0, "openrouter": 0}

Each provider tracks request counts to implement throttling when necessary. The use_shared_state=True parameter enables this coordination.

Sources: src/docs2db/chunks.py#L250-L280, src/docs2db/multiproc.py#L50-L100

7. CLI Interface

7.1 Chunk Command

The chunk command exposes chunking functionality through the CLI:

@app.command()
def chunk(
    content_dir: Annotated[str | None, typer.Option(help="Path to content directory")] = None,
    pattern: Annotated[str, typer.Option(help="Directory pattern")] = "**",
    force: Annotated[bool, typer.Option(help="Force reprocessing even if up-to-date")] = False,
    dry_run: Annotated[bool, typer.Option(help="Show what would process")] = False,
    skip_context: Annotated[bool | None, typer.Option(help="Skip LLM contextual chunk generation")] = None,
    context_model: Annotated[str | None, typer.Option(help="LLM model for context generation")] = None,
    llm_provider: Annotated[str | None, typer.Option(help="LLM provider: 'openai', 'watsonx', 'openrouter', or 'mistral'")] = None,
    openai_url: Annotated[str | None, typer.Option(...)] = None,
    watsonx_url: Annotated[str | None, typer.Option(...)] = None,
    openrouter_url: Annotated[str | None, typer.Option(...)] = None,
    mistral_url: Annotated[str | None, typer.Option(...)] = None,
    context_limit_override: Annotated[int | None, typer.Option(...)] = None,
    workers: Annotated[int | None, typer.Option(help="Number of parallel workers")] = None,
):

Sources: src/docs2db/docs2db.py#L120-L200

7.2 Usage Examples

# Fast chunking without LLM context

docs2db chunk --skip-context

# Using Ollama with local model

docs2db chunk --context-model qwen2.5:7b-instruct

# Using OpenAI

docs2db chunk --openai-url https://api.openai.com --context-model gpt-4o-mini

# Using WatsonX

docs2db chunk --watsonx-url https://us-south.ml.cloud.ibm.com --context-model granite-3-8b

Sources: README.md#L70-L90

8. Token Estimation and Limits

8.1 Estimation Algorithm

The system uses character-based token estimation for efficiency:

def estimate_tokens(text: str) -> int:
    """Estimate token count using conservative approximation.
    
    Formula: chars / 3.0
    - Regular English prose: ~4-5 chars/token (conservative at 3)
    - Code/data/numbers: ~2-3 chars/token
    """
    char_count = len(text)
    return int(char_count / 3.0)

The 3 characters-per-token formula provides a conservative estimate that accommodates diverse content types including prose, code, and data.

Sources: src/docs2db/chunks.py#L320-L345

8.2 Model Context Limits

The system maintains a dictionary of known model context limits:

MODEL_CONTEXT_LIMITS = {
    "qwen2.5:7b-instruct": 32768,
    # Additional models may be added
}

When no explicit limit is found, a default of 32768 tokens is used.

Sources: src/docs2db/chunks.py#L80-L95

9. Incremental Processing

9.1 Staleness Detection

The system tracks file modification times to avoid reprocessing unchanged documents:

if not force and not is_chunks_stale(chunks_file, source_file):
    return chunks_file  # Skip processing

The staleness check compares the modification time of the chunks file against the source file. If chunks exist and are newer than the source, processing is skipped.

Sources: src/docs2db/chunks.py#L180-L195

9.2 Lazy LLM Session Initialization

LLM sessions are created only when needed:

# Check if any files need LLM processing

llm_session_needed = False
for source_file in source_list:
    chunks_file = source_file.parent / "chunks.json"
    if force or is_chunks_stale(chunks_file, source_file):
        llm_session_needed = True
        break

if llm_session_needed:
    reusable_llm_session = LLMSession(...)

This optimization avoids initializing expensive LLM clients when all files are already processed.

Sources: src/docs2db/chunks.py#L280-L310

10. Output and Metadata

10.1 Chunks File Format

The output chunks.json contains an array of chunk objects:

[
  {
    "text": "Structural context + chunk text",
    "contextual_text": "Semantic context + structural context + chunk text",
    "metadata": {
      "chunk_id": "...",
      "page": 1,
      "heading_hierarchy": ["Section", "Subsection"]
    }
  }
]

10.2 Processing Metadata

Each chunking operation records metadata about the configuration used:

processing_metadata = {
    "chunker": CHUNKING_CONFIG["chunker_class"],
    "parameters": {
        "max_tokens": CHUNKING_CONFIG["max_tokens"],
        "merge_peers": CHUNKING_CONFIG["merge_peers"],
        "tokenizer_model": CHUNKING_CONFIG["tokenizer_model"],
    },
}

if not skip_context:
    enrichment_metadata = {
        "model": context_model,
        "provider": "watsonx" | "openrouter" | "mistral",
        "endpoint": url,
    }

Sources: src/docs2db/chunks.py#L240-L270

11. Observed Structural Patterns and Gaps

11.1 Provider Selection Logic

The provider selection follows a priority order: explicit --llm-provider flag > environment variable > URL-based inference. However, the code shows that URL flags can override provider selection even when a provider is explicitly specified:

if provider == "watsonx" or watsonx_url:
    # Creates WatsonX provider
elif provider == "openrouter" or openrouter_url:
    # Creates OpenRouter provider

This implicit override could lead to unexpected behavior if users specify conflicting options.

Sources: src/docs2db/chunks.py#L700-L750

11.2 Missing Error Handling

The chunk generation function lacks explicit error handling for LLM API failures. If get_chunk_context raises an exception, the entire batch may fail. There’s no retry logic at the chunk level (only at the API request level via @_get_llm_retry_decorator).

Sources: src/docs2db/chunks.py#L220-L235

11.3 Summarization Once Per Session

The document summarization only occurs once per LLMSession instance. If the session is reused across many documents without calling set_document, subsequent large documents may fail to generate context properly.

Sources: src/docs2db/chunks.py#L400-L420

12. Conclusion

Chunking and Contextual Retrieval in docs2db represents a sophisticated pipeline stage that transforms ingested documents into retrieval-optimized text segments. The architecture demonstrates a clean separation between structural chunking (handled by Docling’s HybridChunker) and semantic enrichment (delegated to configurable LLM providers).

Key structural observations:

Modular Provider Design: The abstraction enables flexible LLM provider selection without modifying core chunking logic.
Parallel Processing: Integration with BatchProcessor enables scalable document processing while maintaining coordination through shared state.
Incremental Processing: Staleness detection and lazy LLM initialization optimize for repeated runs over evolving document collections.
Context Safety: Token estimation, model limits, and summarization mechanisms protect against context overflow.
Structured Output: Dual text representations (text and contextual_text) serve different purposes—LLM consumption versus vector indexing.

The system’s design reflects practical RAG engineering principles, balancing computational efficiency against retrieval quality through contextual enrichment. The observed gaps in error handling and provider selection logic represent areas where defensive programming could strengthen robustness.

Sources: src/docs2db/chunks.py#L1-L50, src/docs2db/docs2db.py#L120-L200, README.md#L50-L90, CHANGELOG.md#L1-L30