Wiki: Docs2db Generated: 2026-02-13
Relevant source files The following files were used as context for generating this wiki page: - [README.md](https://github.com/b08x/docs2db/blob/main/README.md) - [src/docs2db/chunks.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/chunks.py) - [src/docs2db/docs2db.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/docs2db.py) - [src/docs2db/ingest.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/ingest.py) - [src/docs2db/multiproc.py](https://github.com/b08x/docs2db/blob/main/src/docs2db/multiproc.py)

Installation

Introduction

The installation mechanism for docs2db encompasses the setup of Python package dependencies, container runtime environment for the PostgreSQL database, and configuration of environment variables. The system operates as a command-line tool that processes documents through a RAG (Retrieval-Augmented Generation) pipeline, requiring both a Python runtime with specific package dependencies and a container runtime (Docker or Podman) for database services. The installation process is designed to support both end-user deployment and development workflows, with the primary installation method utilizing the uv package manager.

Prerequisites

Runtime Requirements

The docs2db system requires two distinct runtime environments: a Python environment for document processing and a container runtime for the PostgreSQL database service.

Requirement Type Description
Python Runtime Package processing and CLI operations
Docker or Podman Runtime Database container management
uv Package Manager Python package installation

Sources: README.md#L1-L50

Container Runtime Detection

The system performs automatic detection of container runtimes during database initialization. The detection logic searches for either Docker or Podman executables in the system PATH. If neither is found, the system logs an error message instructing users to install either Podman or Docker.

# From database.py - Container runtime detection

try:
    result = subprocess.run(
        ["docker", "ps"], capture_output=True, text=True, timeout=5
    )
    if result.returncode == 0:
        return "docker"
except FileNotFoundError:
    pass

try:
    result = subprocess.run(
        ["podman", "ps"], capture_output=True, text=True, timeout=5
    )
    if result.returncode == 0:
        return "podman"
except FileNotFoundError:
    pass

Sources: README.md#troubleshooting

Python Package Installation

The recommended installation method utilizes the uv package manager, which provides faster dependency resolution and installation compared to traditional pip-based approaches.

uv add docs2db

This command installs the package and its dependencies into the current Python environment. The uv tool resolves and installs all required packages defined in the project’s dependency specifications.

Sources: README.md#using-as-a-library

Development Installation

For contributors setting up a development environment, the installation process involves cloning the repository and synchronizing dependencies.

git clone https://github.com/rhel-lightspeed/docs2db
cd docs2db
uv sync
pre-commit install

The uv sync command installs all development dependencies defined in the project configuration, while pre-commit install sets up Git hooks for code quality checks.

Sources: README.md#development

Environment Configuration

Configuration Methods

The system supports multiple configuration approaches, with environment variables and .env files being the primary mechanisms for API credentials and service endpoints.

Configuration Type Location Priority
Environment Variables System shell Highest
.env file Project root Medium
CLI arguments Command line Highest (override)

Sources: src/docs2db/chunks.py#L1-L100

Required Environment Variables

The configuration requirements vary depending on which processing stages are enabled. The following environment variables control different aspects of the system:

Database Configuration:

  • Database connection parameters (host, port, user, password, db name)

LLM Provider Configuration:

  • WATSONX_API_KEY - IBM WatsonX API authentication
  • WATSONX_PROJECT_ID - WatsonX project identifier
  • OPENAI_API_KEY - OpenAI API authentication
  • MISTRAL_API_KEY - Mistral AI API authentication
  • OPENROUTER_API_KEY - OpenRouter API authentication

Processing Configuration:

  • DOCLING_PIPELINE - Document processing pipeline selection
  • DOCLING_MODEL - Specific model for docling
  • DOCLING_DEVICE - Processing device (auto, cpu, cuda, mps)

Sources: src/docs2db/chunks.py#L1-L50, src/docs2db/ingest.py#L1-L50

WatsonX Provider Configuration

When using IBM WatsonX as the LLM provider for contextual chunk generation, specific configuration is required:

# From chunks.py - WatsonX provider initialization

if provider == "watsonx":
    if not self.watsonx_url:
        raise ValueError(
            "provider is 'watsonx' but watsonx_url is None. "
            "WatsonX API URL is required."
        )

    api_key = settings.watsonx_api_key
    project_id = settings.watsonx_project_id

    if not api_key or not project_id:
        raise ValueError(
            "WATSONX_API_KEY and WATSONX_PROJECT_ID must be set (via env vars or .env file)"
        )

Sources: src/docs2db/chunks.py#L1-L50

Mistral Provider Configuration

The Mistral AI provider requires an API key that can be set via environment variable:

# From chunks.py - Mistral provider validation

if not api_key:
    raise ValueError(
        "Mistral API key required. "
        "Set MISTRAL_API_KEY environment variable or get one ..."
    )

Sources: src/docs2db/chunks.py#L1-L100

Database Setup

Container-Based Database

The system uses a containerized PostgreSQL database with the pgvector extension for vector similarity search. Database lifecycle management is handled through the CLI commands.

Command Function
docs2db db-start Start the database container
docs2db db-stop Stop the database container
docs2db db-status Check database connection status

Sources: README.md#troubleshooting, src/docs2db/docs2db.py#L1-L50

Database Initialization Flow

Sources: README.md#troubleshooting

CLI Installation Verification

Command Availability

After installation, the docs2db CLI command becomes available. The CLI provides multiple subcommands for different stages of the RAG pipeline:

docs2db --help

The help output displays available commands including:

  • ingest - Convert source documents to Docling JSON format
  • chunk - Generate text chunks with optional LLM context
  • embed - Generate vector embeddings
  • load - Load processed data into the database
  • db-start, db-stop, db-status - Database lifecycle management

Sources: src/docs2db/docs2db.py#L1-L100

Ingest Command

The ingest command processes source files and converts them to the internal Docling JSON format:

docs2db ingest SOURCE_PATH [OPTIONS]

Key options include:

  • --dry-run - Preview processing without execution
  • --force - Force reprocessing of existing files
  • --pipeline - Docling pipeline selection (standard or vlm)
  • --device - Processing device (auto, cpu, cuda, mps)
  • --batch-size - Documents per worker batch
  • --workers - Number of parallel workers

Sources: src/docs2db/docs2db.py#L1-L100, src/docs2db/ingest.py#L1-L50

Chunk Command

The chunk command generates text chunks from ingested documents with optional LLM-generated contextual enrichment:

docs2db chunk [OPTIONS]

Key options include:

  • --skip-context - Disable LLM contextual generation (faster processing)
  • --context-model - LLM model for context generation
  • --llm-provider - Provider selection (openai, watsonx, openrouter, mistral)
  • --openai-url - OpenAI-compatible API endpoint
  • --watsonx-url - WatsonX API endpoint

Sources: src/docs2db/docs2db.py#L1-L100, src/docs2db/chunks.py#L1-L50

Processing Pipeline Architecture

Pipeline Stages

The docs2db system implements a sequential processing pipeline where each stage produces artifacts consumed by subsequent stages:

Each stage creates intermediate files in the content directory (docs2db_content/), allowing incremental processing and reuse of expensive preprocessing steps.

Sources: README.md#content-directory

Parallel Processing Configuration

The system utilizes multiprocessing for computationally intensive operations, with configurable worker counts and memory thresholds:

# From multiproc.py - Batch processor initialization

processor = BatchProcessor(
    worker_function=ingest_batch,
    worker_args=(str(source_root), force, pipeline, model, device, batch_size),
    progress_message="Ingesting files...",
    batch_size=settings.docling_batch_size,
    mem_threshold_mb=1500,  # Lower threshold for docling processes
    max_workers=settings.docling_workers,
)

The BatchProcessor class manages parallel file processing with progress tracking and error handling.

Sources: src/docs2db/multiproc.py#L1-L50, src/docs2db/ingest.py#L1-L50

Content Directory Structure

Directory Initialization

The content directory (docs2db_content/) stores all intermediate processing files. The system automatically ensures the README exists in this directory:

# From ingest.py - Directory setup

ensure_content_dir_readme()

The directory structure mirrors the source document hierarchy, with each source file receiving its own subdirectory containing:

File Purpose
source.json Ingested document in Docling JSON format
chunks.json Text chunks with optional LLM context
gran.json Vector embeddings (filename varies by model)
meta.json Processing metadata and timestamps

Sources: README.md#content-directory, src/docs2db/ingest.py#L1-L50

Troubleshooting Installation Issues

Container Runtime Errors

If the system reports “Neither Podman nor Docker found”, users must install one of these container runtimes:

Runtime Installation URL
Podman https://podman.io/getting-started/installation
Docker https://docs.docker.com/get-docker/

Sources: README.md#troubleshooting

Database Connection Errors

When experiencing “Database connection refused”:

docs2db db-start      # Start the database
docs2db db-status     # Check connection status

Sources: README.md#troubleshooting

Configuration Inheritance

Settings Precedence

The system implements a hierarchical configuration resolution where CLI arguments override environment variables, which override default values:

This precedence ensures that users can override defaults at runtime while maintaining sensible fallbacks for unspecified options.

Sources: src/docs2db/ingest.py#L1-L50, src/docs2db/chunks.py#L1-L50

Conclusion

The installation architecture of docs2db reflects a modular design where Python package installation, container runtime setup, and environment configuration operate as independent but interconnected subsystems. The system prioritizes developer experience through multiple configuration mechanisms and clear error messaging for common installation failures. The prerequisite requirements are minimal—Python with uv, and Docker or Podman—making deployment straightforward across different environments. The sequential pipeline design with intermediate file storage enables incremental processing and supports efficient workflows for large document collections. The absence of a unified configuration file format (relying instead on environment variables and CLI arguments) represents a structural choice that prioritizes flexibility over opinionated defaults, though it may increase initial setup complexity for users preferring declarative configuration management.

Navigation