Linguistic RAG Architecture

An interactive exploration of a novel database architecture for Retrieval-Augmented Generation, grounded in formal linguistic theory.

The Core Dichotomy: Structure vs. Content

⚙️

Universal Grammar (UG) as "Firmware"

Represents the immutable, deep structures of language. It's the stable, rule-governed framework (syntax, phrase structures) that is universal across contexts.

🎨

Systemic Functional Linguistics (SFL) as "Software"

Represents the malleable, functional content of language. It's how specific utterances create meaning in a particular context (e.g., mood, theme, participant roles).

Interactive Schema Explorer

Select a view and click on a model to see its details.

Model Details

Click on a model in the diagram to view its properties and associations here.

Hybrid Data Processing Pipeline

This 5-stage pipeline processes raw text to populate both the UG and SFL models. Click each stage to expand.

Comparison with Standard RAG

✅ Advantages of this Architecture

  • Contextual Precision: Retrieves information based on grammatical role and function, not just keyword similarity.
  • Reduced Hallucinations: Provides the LLM with structured, less ambiguous context, reducing factual errors.
  • Explainability: Queries can be traced through a formal linguistic structure, making results more transparent.
  • Complex Queries: Enables queries that combine structural and semantic criteria (e.g., "Find all clauses where 'the company' is the actor in a material process").

❌ Challenges & Disadvantages

  • Complexity: Requires significant upfront design and linguistic expertise to model the database schema.
  • Performance Overhead: Parsing text into deep linguistic structures is computationally more expensive than vectorization.
  • Scalability: Complex relational queries might be slower at massive scale compared to optimized vector index lookups.
  • Brittleness: The formal structure might struggle to flexibly handle highly idiomatic or ungrammatical language.