Wiki: Docs2db Generated: 2026-02-13

Relevant source files

The following files were used as context for generating this wiki page: - [README.md](https://github.com/b08x/docs2db/blob/main/README.md) - [src/docs2db/docs2db.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/docs2db.py) - [src/docs2db/ingest.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/ingest.py) - [src/docs2db/chunks.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/chunks.py) - [src/docs2db/multiproc.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/multiproc.py)

Quickstart Guide

Introduction

The Quickstart Guide serves as the entry point for users of the docs2db system, which is a RAG (Retrieval Augmented Generation) pipeline tool designed to transform source documents into a searchable vector database. The system operates through a multi-stage pipeline: ingestion, chunking, embedding generation, and database loading. Each stage produces intermediate artifacts that subsequent stages consume, creating a dependency chain documented within the CLI commands exposed via the typer framework in docs2db.py.

The repository structure reveals a modular architecture where docs2db.py defines the command-line interface as the primary entry point, orchestrating operations through four main commands: ingest, chunk, embed, and load. The ingest.py module handles document conversion to Docling JSON format, while chunks.py manages text splitting with optional LLM-generated contextual enrichment. The multiproc.py module provides the batch processing infrastructure enabling parallel execution across multiple workers.

System Architecture Overview

The docs2db system follows a staged pipeline architecture where each stage operates on files produced by the previous stage. The content directory structure mirrors the source document hierarchy, with each source file receiving its own subdirectory containing processing artifacts.

Pipeline Stages

The complete processing pipeline consists of four sequential stages, each with distinct responsibilities:

Stage	Command	Input	Output	Primary Module
Ingestion	`docs2db ingest`	Source files (PDF, HTML, MD)	`source.json` (Docling JSON)	`ingest.py`
Chunking	`docs2db chunk`	`source.json`	`chunks.json`	`chunks.py`
Embedding	`docs2db embed`	`chunks.json`	`gran.json` (or model-specific)	`embed.py`
Loading	`docs2db load`	`gran.json` + `chunks.json`	Database records	`database.py`

Sources: README.md#L1-L50, src/docs2db/docs2db.py#L1-L100

Content Directory Structure

The system maintains intermediate processing files in a content directory (default: docs2db_content/). This directory must be committed to version control as it contains expensive preprocessing that can be reused across updates.

docs2db_content/
├── path/
│   └── to/
│       └── your/
│           └── document/
│               ├── source.json      # Docling ingested document
│               ├── chunks.json      # Text chunks with context
│               ├── gran.json       # Granite embeddings
│               └── meta.json       # Processing metadata
└── README.md

Sources: README.md#L60-L80

Core Components

CLI Entry Point

The main CLI application is defined using typer in docs2db.py, providing subcommands for each pipeline stage. The application automatically detects database configuration from compose files when not explicitly provided.

app = typer.Typer(help="Make a RAG Database from source content")

@app.command()
def ingest(...): ...

@app.command()
def chunk(...): ...

@app.command()
def embed(...): ...

@app.command()
def load(...): ...

Sources: src/docs2db/docs2db.py#L1-L30

Ingestion Module

The ingestion module converts various document formats into Docling JSON format. It uses a singleton pattern for the DocumentConverter to avoid redundant initialization overhead across multiple files.

def _get_converter() -> Any:
    """Get or create the DocumentConverter singleton."""
    global _converter, _last_converter_settings
    # ... configuration logic

The module supports configurable pipelines (standard or vlm), models, devices (auto, cpu, cuda, mps), and batch sizes. Source files are discovered via find_ingestible_files() and processed in sorted order for deterministic behavior.

Sources: src/docs2db/ingest.py#L1-L100, src/docs2db/ingest.py#L200-L280

Chunking Module

The chunking module splits documents into smaller text segments suitable for embedding and retrieval. It supports two types of context generation:

Structural context - Heading hierarchy, page numbers from Docling
Semantic context - LLM-generated descriptions of chunk relevance

The module integrates with multiple LLM providers (OpenAI, WatsonX, OpenRouter, Mistral) for contextual enrichment, using a chat-based message format.

class LLMProvider(ABC):
    """Abstract base class for LLM providers."""
    
    def get_chunk_context(self, chunk_prompt: str) -> str:
        """Get context for a chunk using LLM."""
        
    def summarize_text(self, text: str) -> str:
        """Summarize text using LLM."""

Sources: src/docs2db/chunks.py#L1-L80, src/docs2db/chunks.py#L200-L250

Batch Processing Infrastructure

The multiprocessing module provides the BatchProcessor class that enables parallel file processing with progress tracking and memory management.

class BatchProcessor:
    def __init__(
        self,
        worker_function,
        worker_args,
        progress_message: str,
        batch_size: int,
        mem_threshold_mb: int,
        max_workers: int,
        use_shared_state: bool = False,
    ):

The system uses worker-based parallelism with configurable memory thresholds to prevent OOM conditions during processing.

Sources: src/docs2db/multiproc.py#L1-L50

Data Flow and Processing Sequence

The following sequence diagram illustrates the complete processing flow from source documents to searchable database:

The pipeline automatically skips unchanged files when re-running commands, enabling efficient incremental updates. This is determined by timestamp comparison between source and output files.

Sources: src/docs2db/docs2db.py#L100-L200, src/docs2db/ingest.py#L150-L180

Configuration and Environment

CLI Options by Command

Ingest Command

Parameter	Type	Default	Description
`source_path`	Argument	Required	Path to directory or file to ingest
`--dry-run`	Option	False	Show what would be processed
`--force`	Option	False	Force reprocessing
`--pipeline`	Option	“standard”	Docling pipeline: “standard” or “vlm”
`--model`	Option	None	Docling model (pipeline-specific)
`--device`	Option	“auto”	Device: “auto”, “cpu”, “cuda”, “mps”
`--batch-size`	Option	None	Batch size per worker
`--workers`	Option	None	Number of parallel workers

Sources: src/docs2db/docs2db.py#L30-L60

Chunk Command

Parameter	Type	Default	Description
`--content-dir`	Option	None	Path to content directory
`--pattern`	Option	“**”	Directory pattern for files
`--force`	Option	False	Force reprocessing
`--dry-run`	Option	False	Show what would process
`--skip-context`	Option	None	Skip LLM contextual generation
`--context-model`	Option	None	LLM model for context
`--llm-provider`	Option	None	Provider: “openai”, “watsonx”, “openrouter”, “mistral”

Sources: src/docs2db/docs2db.py#L200-L260

LLM Provider Configuration

The system supports multiple LLM providers for contextual chunk generation:

Provider	Environment Variables	URL Parameter
OpenAI	`OPENAI_API_KEY`	`--openai-url`
WatsonX	`WATSONX_API_KEY`, `WATSONX_PROJECT_ID`	`--watsonx-url`
OpenRouter	`OPENROUTER_API_KEY`	`--openrouter-url`
Mistral	`MISTRAL_API_KEY`	`--mistral-url`

# Example: Using Ollama with OpenAI-compatible endpoint

docs2db chunk --context-model qwen2.5:7b-instruct

# Example: Using OpenAI API

docs2db chunk --openai-url https://api.openai.com --context-model gpt-4o-mini

# Example: Using WatsonX

docs2db chunk --watsonx-url https://us-south.ml.cloud.ibm.com

Sources: README.md#L90-L110, src/docs2db/chunks.py#L300-L350

Using as a Library

The docs2db package can be imported and used programmatically in Python applications:

from pathlib import Path
from docs2db.ingest import ingest_file, ingest_from_content

# Ingest a file from disk

ingest_file(
    source_file=Path("document.pdf"),
    content_path=Path("docs2db_content/my_docs/document"),
    source_metadata={"source": "my_system", "retrieved_at": "2024-01-01"}
)

# Ingest content from memory (HTML, markdown, etc.)

ingest_from_content(
    content="<html>...</html>",
    content_path=Path("docs2db_content/my_docs/page"),
    stream_name="page.html",
    source_metadata={"url": "https://example.com"},
    content_encoding="utf-8"
)

Sources: README.md#L130-L155

Database Operations

The system manages PostgreSQL database lifecycle through dedicated commands:

Command	Function
`docs2db db-start`	Start database container
`docs2db db-stop`	Stop database container
`docs2db db-status`	Check database connection
`docs2db db-destroy`	Destroy database

The database uses pgvector extension for vector similarity search and GIN indexing for full-text search (tsvector/BM25).

Sources: README.md#L115-L125

RAG Features

The system implements several retrieval-augmented generation features:

Contextual chunks - LLM-generated context for each chunk following Anthropic’s approach
Vector embeddings - Multiple models: granite-30m, e5-small-v2, slate-125m, noinstruct-small
Full-text search - PostgreSQL tsvector with GIN indexing for BM25
Vector similarity - pgvector extension with HNSW indexes
Schema versioning - Track metadata and schema changes
Portable dumps - Self-contained SQL files

Sources: README.md#L45-L58

Troubleshooting

Common Issues

Issue	Solution
“Neither Podman nor Docker found”	Install Podman or Docker
“Database connection refused”	Run `docs2db db-start` to start database

Sources: README.md#L160-L165

Conclusion

The Quickstart Guide represents the foundational user-facing documentation for the docs2db RAG pipeline system. The architecture demonstrates a clear separation of concerns: document conversion handled by Docling in ingest.py, text segmentation with optional LLM enrichment in chunks.py, parallel processing infrastructure in multiproc.py, and orchestration through the typer CLI in docs2db.py. The staged pipeline approach with intermediate file artifacts enables both incremental processing and debugging at individual stages. The modular provider abstraction for LLM contextual generation allows flexibility in deployment environments while maintaining consistent interface behavior across different API backends.