Modeling Language as Data
This application explores a novel database architecture for Natural Language Processing. It translates abstract linguistic theories—Universal Grammar (UG) and Systemic Functional Linguistics (SFL)—into a concrete, integrated ActiveRecord schema designed for contextual Retrieval-Augmented Generation (RAG) systems.
Core Linguistic Theories
The architecture models language on two axes: its formal, hierarchical structure (the "what") and its functional, contextual meaning (the "why" and "how").
Universal Grammar (UG)
Provides a blueprint for the formal syntactic structure of text. It allows us to model the underlying grammatical rules (Deep Structure) in a canonical form, independent of surface-level variations.
- Focus: Formal Syntax & Structure
- Key Concept: X-bar theory models phrasal hierarchies (NP, VP, etc.).
- Application: Defines the schema for `Lexeme`, `Phrase`, and `DeepStructure` models.
Systemic Functional Linguistics (SFL)
Provides a framework for modeling the functional and contextual dimensions of language. It analyzes how language is used to make meaning based on the situation.
- Focus: Function & Contextual Meaning
- Key Concept: The `ContextOfSituation` (field, tenor, mode) shapes meaning.
- Application: Defines the schema for `Clause` and its three metafunctional frames.
Interactive Schema Explorer
This diagram visualizes the integrated ActiveRecord schema. The top half represents the UG-based structural models, while the bottom half shows the SFL-based functional models. Click on any model to view its detailed schema.
Data Processing Pipeline
Raw text is processed through a series of specialized Ruby gems to extract structural and functional information, which then populates the schema models.
Tokenization
Pragmatic Tokenizer
POS & DEP Parsing
ruby-spacy
Deep Grammar
Link Parser
Semantic Relations
ruby-wordnet
Interactive RAG Demo
This simulates how a RAG application queries the SFL models. Select a `ContextOfSituation` to see how the three metafunction frames are weighted differently to retrieve or generate the most relevant response.