Modeling Language as Data
This application explores a novel approach to computational linguistics: modeling language using a dual-system database framework inspired by ActiveRecord. We separate the immutable, universal rules of language (Universal Grammar) from the fluid, context-dependent usage (Systemic Functional Linguistics) to create a powerful model for Contextual Retrieval-Augmented Generation (RAG) applications.
Universal Grammar (UG) as Schema
UG represents the innate, structural blueprint of language. In our model, this translates to the database schema—the immutable tables, columns, and relationships that define grammatical possibility. It's the fixed architecture that all language instances must adhere to.
Systemic Functional Linguistics (SFL) as Records
SFL describes how language is used in specific contexts to create meaning. This becomes the data in our database—the malleable records and content that populate the UG schema. Each record is an instance of language in use, rich with contextual metadata.
The Dual Model Architecture
Below is a conceptual representation of the ORM. On the left, the Universal Grammar (UG) models define the rigid structure. On the right, the Systemic Functional Linguistics (SFL) models contain the dynamic content. Click on a UG model to see how it relates to SFL records.
UG: The Immutable Schema
Constituent
Defines basic phrasal units.- id: integer (PK)
- type: string (e.g., 'Noun Phrase')
DependencyRule
Grammatical relationships between words.- id: integer (PK)
- type: string (e.g., 'nsubj')
- description: string
Lexeme
The abstract dictionary form of a word.- id: integer (PK)
- lemma: string
- pos_tag: string (e.g., 'VERB')
SFL: The Malleable Content
Utterance #123
"[The quick brown fox] jumps..."
Constituent: Noun PhraseUtterance #123
"The quick brown [fox] [jumps]..."
Dependency: nsubj (nominal subject)Token: "jumps"
Context: part of Utterance #123
Lexeme: 'jump' (VERB)Utterance #456
"...[over the lazy dog]."
Constituent: Prepositional PhraseUtterance #456
"...[over] the lazy [dog]."
Dependency: pobj (object of preposition)Token: "dog"
Context: part of Utterance #456
Lexeme: 'dog' (NOUN)Interactive RAG Demo
See how the UG/SFL model powers a Contextual RAG application. Enter a query or use the example below, and watch how the system parses, retrieves, and generates a response based on our linguistic database model.
Parsing (UG)
The query is broken down into its grammatical structure based on the UG schema. Key constituents and dependencies are identified.
Contextual Retrieval (SFL)
The parsed structure is used to query the SFL database for relevant records (utterances, documents). A relevance score is computed for each retrieved chunk.
Generation (LLM)
The highest-scoring retrieved chunks are passed to a Large Language Model (LLM) as context to generate a precise, grounded answer.
Supporting Tech Stack
This conceptual model would be built using a combination of powerful Ruby gems for linguistics, ORM, and interfacing with language models.
Parsing & Segmentation
Gems like pragmatic_tokenizer, pragmatic_segmenter, and linkparser would handle the initial processing of text into structured UG components.
Linguistic Analysis
Libraries such as lingua, ruby-wordnet, and ruby-spacy would provide deeper semantic and syntactic analysis to populate the SFL models.
Object-Relational Mapping
ActiveRecord is the primary inspiration, providing the framework for defining the UG/SFL models and their interactions with the database.
Database & Data Stores
For key-value or graph-like relationships within the SFL data, libraries like ohm could complement a traditional SQL database.
LLM Integration
The ruby_llm gem would provide the interface for sending retrieved SFL context to a language model for the final generation step.