Wiki: Docs2db Generated: 2026-02-13

Relevant source files

The following files were used as context for generating this wiki page: - [README.md](https://github.com/b08x/docs2db/blob/main/README.md) - [src/docs2db/chunks.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/chunks.py) - [src/docs2db/docs2db.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/docs2db.py) - [src/docs2db/ingest.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/ingest.py) - [src/docs2db/multiproc.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/multiproc.py)

Installation

Introduction

The installation mechanism for docs2db encompasses the setup of Python package dependencies, container runtime environment for the PostgreSQL database, and configuration of environment variables. The system operates as a command-line tool that processes documents through a RAG (Retrieval-Augmented Generation) pipeline, requiring both a Python runtime with specific package dependencies and a container runtime (Docker or Podman) for database services. The installation process is designed to support both end-user deployment and development workflows, with the primary installation method utilizing the uv package manager.

Prerequisites

Runtime Requirements

The docs2db system requires two distinct runtime environments: a Python environment for document processing and a container runtime for the PostgreSQL database service.

Requirement	Type	Description
Python	Runtime	Package processing and CLI operations
Docker or Podman	Runtime	Database container management
uv	Package Manager	Python package installation

Sources: README.md#L1-L50

Container Runtime Detection

The system performs automatic detection of container runtimes during database initialization. The detection logic searches for either Docker or Podman executables in the system PATH. If neither is found, the system logs an error message instructing users to install either Podman or Docker.

# From database.py - Container runtime detection

try:
    result = subprocess.run(
        ["docker", "ps"], capture_output=True, text=True, timeout=5
    )
    if result.returncode == 0:
        return "docker"
except FileNotFoundError:
    pass

try:
    result = subprocess.run(
        ["podman", "ps"], capture_output=True, text=True, timeout=5
    )
    if result.returncode == 0:
        return "podman"
except FileNotFoundError:
    pass

Sources: README.md#troubleshooting

Python Package Installation

Using uv (Recommended)

The recommended installation method utilizes the uv package manager, which provides faster dependency resolution and installation compared to traditional pip-based approaches.

uv add docs2db

This command installs the package and its dependencies into the current Python environment. The uv tool resolves and installs all required packages defined in the project’s dependency specifications.

Sources: README.md#using-as-a-library

Development Installation

For contributors setting up a development environment, the installation process involves cloning the repository and synchronizing dependencies.

git clone https://github.com/rhel-lightspeed/docs2db
cd docs2db
uv sync
pre-commit install

The uv sync command installs all development dependencies defined in the project configuration, while pre-commit install sets up Git hooks for code quality checks.

Sources: README.md#development

Environment Configuration

Configuration Methods

The system supports multiple configuration approaches, with environment variables and .env files being the primary mechanisms for API credentials and service endpoints.

Configuration Type	Location	Priority
Environment Variables	System shell	Highest
.env file	Project root	Medium
CLI arguments	Command line	Highest (override)

Sources: src/docs2db/chunks.py#L1-L100

Required Environment Variables

The configuration requirements vary depending on which processing stages are enabled. The following environment variables control different aspects of the system:

Database Configuration:

Database connection parameters (host, port, user, password, db name)

LLM Provider Configuration:

WATSONX_API_KEY - IBM WatsonX API authentication
WATSONX_PROJECT_ID - WatsonX project identifier
OPENAI_API_KEY - OpenAI API authentication
MISTRAL_API_KEY - Mistral AI API authentication
OPENROUTER_API_KEY - OpenRouter API authentication

Processing Configuration:

DOCLING_PIPELINE - Document processing pipeline selection
DOCLING_MODEL - Specific model for docling
DOCLING_DEVICE - Processing device (auto, cpu, cuda, mps)

Sources: src/docs2db/chunks.py#L1-L50, src/docs2db/ingest.py#L1-L50

WatsonX Provider Configuration

When using IBM WatsonX as the LLM provider for contextual chunk generation, specific configuration is required:

# From chunks.py - WatsonX provider initialization

if provider == "watsonx":
    if not self.watsonx_url:
        raise ValueError(
            "provider is 'watsonx' but watsonx_url is None. "
            "WatsonX API URL is required."
        )

    api_key = settings.watsonx_api_key
    project_id = settings.watsonx_project_id

    if not api_key or not project_id:
        raise ValueError(
            "WATSONX_API_KEY and WATSONX_PROJECT_ID must be set (via env vars or .env file)"
        )

Sources: src/docs2db/chunks.py#L1-L50

Mistral Provider Configuration

The Mistral AI provider requires an API key that can be set via environment variable:

# From chunks.py - Mistral provider validation

if not api_key:
    raise ValueError(
        "Mistral API key required. "
        "Set MISTRAL_API_KEY environment variable or get one ..."
    )

Sources: src/docs2db/chunks.py#L1-L100

Database Setup

Container-Based Database

The system uses a containerized PostgreSQL database with the pgvector extension for vector similarity search. Database lifecycle management is handled through the CLI commands.

Command	Function
`docs2db db-start`	Start the database container
`docs2db db-stop`	Stop the database container
`docs2db db-status`	Check database connection status

Sources: README.md#troubleshooting, src/docs2db/docs2db.py#L1-L50

Database Initialization Flow

Sources: README.md#troubleshooting

CLI Installation Verification

Command Availability

After installation, the docs2db CLI command becomes available. The CLI provides multiple subcommands for different stages of the RAG pipeline:

docs2db --help

The help output displays available commands including:

ingest - Convert source documents to Docling JSON format
chunk - Generate text chunks with optional LLM context
embed - Generate vector embeddings
load - Load processed data into the database
db-start, db-stop, db-status - Database lifecycle management

Sources: src/docs2db/docs2db.py#L1-L100

Ingest Command

The ingest command processes source files and converts them to the internal Docling JSON format:

docs2db ingest SOURCE_PATH [OPTIONS]

Key options include:

--dry-run - Preview processing without execution
--force - Force reprocessing of existing files
--pipeline - Docling pipeline selection (standard or vlm)
--device - Processing device (auto, cpu, cuda, mps)
--batch-size - Documents per worker batch
--workers - Number of parallel workers

Sources: src/docs2db/docs2db.py#L1-L100, src/docs2db/ingest.py#L1-L50

Chunk Command

The chunk command generates text chunks from ingested documents with optional LLM-generated contextual enrichment:

docs2db chunk [OPTIONS]

Key options include:

--skip-context - Disable LLM contextual generation (faster processing)
--context-model - LLM model for context generation
--llm-provider - Provider selection (openai, watsonx, openrouter, mistral)
--openai-url - OpenAI-compatible API endpoint
--watsonx-url - WatsonX API endpoint

Sources: src/docs2db/docs2db.py#L1-L100, src/docs2db/chunks.py#L1-L50

Processing Pipeline Architecture

Pipeline Stages

The docs2db system implements a sequential processing pipeline where each stage produces artifacts consumed by subsequent stages:

Each stage creates intermediate files in the content directory (docs2db_content/), allowing incremental processing and reuse of expensive preprocessing steps.

Sources: README.md#content-directory

Parallel Processing Configuration

The system utilizes multiprocessing for computationally intensive operations, with configurable worker counts and memory thresholds:

# From multiproc.py - Batch processor initialization

processor = BatchProcessor(
    worker_function=ingest_batch,
    worker_args=(str(source_root), force, pipeline, model, device, batch_size),
    progress_message="Ingesting files...",
    batch_size=settings.docling_batch_size,
    mem_threshold_mb=1500,  # Lower threshold for docling processes
    max_workers=settings.docling_workers,
)

The BatchProcessor class manages parallel file processing with progress tracking and error handling.

Sources: src/docs2db/multiproc.py#L1-L50, src/docs2db/ingest.py#L1-L50

Content Directory Structure

Directory Initialization

The content directory (docs2db_content/) stores all intermediate processing files. The system automatically ensures the README exists in this directory:

# From ingest.py - Directory setup

ensure_content_dir_readme()

The directory structure mirrors the source document hierarchy, with each source file receiving its own subdirectory containing:

File	Purpose
`source.json`	Ingested document in Docling JSON format
`chunks.json`	Text chunks with optional LLM context
`gran.json`	Vector embeddings (filename varies by model)
`meta.json`	Processing metadata and timestamps

Sources: README.md#content-directory, src/docs2db/ingest.py#L1-L50

Troubleshooting Installation Issues

Container Runtime Errors

If the system reports “Neither Podman nor Docker found”, users must install one of these container runtimes:

Runtime	Installation URL
Podman	https://podman.io/getting-started/installation
Docker	https://docs.docker.com/get-docker/

Sources: README.md#troubleshooting

Database Connection Errors

When experiencing “Database connection refused”:

docs2db db-start      # Start the database
docs2db db-status     # Check connection status

Sources: README.md#troubleshooting

Configuration Inheritance

Settings Precedence

The system implements a hierarchical configuration resolution where CLI arguments override environment variables, which override default values:

This precedence ensures that users can override defaults at runtime while maintaining sensible fallbacks for unspecified options.

Sources: src/docs2db/ingest.py#L1-L50, src/docs2db/chunks.py#L1-L50

Conclusion

The installation architecture of docs2db reflects a modular design where Python package installation, container runtime setup, and environment configuration operate as independent but interconnected subsystems. The system prioritizes developer experience through multiple configuration mechanisms and clear error messaging for common installation failures. The prerequisite requirements are minimal—Python with uv, and Docker or Podman—making deployment straightforward across different environments. The sequential pipeline design with intermediate file storage enables incremental processing and supports efficient workflows for large document collections. The absence of a unified configuration file format (relying instead on environment variables and CLI arguments) represents a structural choice that prioritizes flexibility over opinionated defaults, though it may increase initial setup complexity for users preferring declarative configuration management.