Modeling Language as Data

This application explores a novel approach to computational linguistics: modeling language using a dual-system database framework inspired by ActiveRecord. We separate the immutable, universal rules of language (Universal Grammar) from the fluid, context-dependent usage (Systemic Functional Linguistics) to create a powerful model for Contextual Retrieval-Augmented Generation (RAG) applications.

Universal Grammar (UG) as Schema

UG represents the innate, structural blueprint of language. In our model, this translates to the database schema—the immutable tables, columns, and relationships that define grammatical possibility. It's the fixed architecture that all language instances must adhere to.

Systemic Functional Linguistics (SFL) as Records

SFL describes how language is used in specific contexts to create meaning. This becomes the data in our database—the malleable records and content that populate the UG schema. Each record is an instance of language in use, rich with contextual metadata.

The Dual Model Architecture

Below is a conceptual representation of the ORM. On the left, the Universal Grammar (UG) models define the rigid structure. On the right, the Systemic Functional Linguistics (SFL) models contain the dynamic content. Click on a UG model to see how it relates to SFL records.

UG: The Immutable Schema

Constituent

Defines basic phrasal units.
  • id: integer (PK)
  • type: string (e.g., 'Noun Phrase')

DependencyRule

Grammatical relationships between words.
  • id: integer (PK)
  • type: string (e.g., 'nsubj')
  • description: string

Lexeme

The abstract dictionary form of a word.
  • id: integer (PK)
  • lemma: string
  • pos_tag: string (e.g., 'VERB')

SFL: The Malleable Content

Utterance #123

"[The quick brown fox] jumps..."

Constituent: Noun Phrase

Utterance #123

"The quick brown [fox] [jumps]..."

Dependency: nsubj (nominal subject)

Token: "jumps"

Context: part of Utterance #123

Lexeme: 'jump' (VERB)

Utterance #456

"...[over the lazy dog]."

Constituent: Prepositional Phrase

Utterance #456

"...[over] the lazy [dog]."

Dependency: pobj (object of preposition)

Token: "dog"

Context: part of Utterance #456

Lexeme: 'dog' (NOUN)

Interactive RAG Demo

See how the UG/SFL model powers a Contextual RAG application. Enter a query or use the example below, and watch how the system parses, retrieves, and generates a response based on our linguistic database model.

1

Parsing (UG)

The query is broken down into its grammatical structure based on the UG schema. Key constituents and dependencies are identified.

[NP: What] [VP: is [NP: the primary goal] [PP: of [NP: the new initiative]]]?
2

Contextual Retrieval (SFL)

The parsed structure is used to query the SFL database for relevant records (utterances, documents). A relevance score is computed for each retrieved chunk.

3

Generation (LLM)

The highest-scoring retrieved chunks are passed to a Large Language Model (LLM) as context to generate a precise, grounded answer.

The primary goal of the new initiative, as stated in the project charter, is to "enhance customer engagement by 20% through personalized digital experiences."

Supporting Tech Stack

This conceptual model would be built using a combination of powerful Ruby gems for linguistics, ORM, and interfacing with language models.

Parsing & Segmentation

Gems like pragmatic_tokenizer, pragmatic_segmenter, and linkparser would handle the initial processing of text into structured UG components.

Linguistic Analysis

Libraries such as lingua, ruby-wordnet, and ruby-spacy would provide deeper semantic and syntactic analysis to populate the SFL models.

Object-Relational Mapping

ActiveRecord is the primary inspiration, providing the framework for defining the UG/SFL models and their interactions with the database.

Database & Data Stores

For key-value or graph-like relationships within the SFL data, libraries like ohm could complement a traditional SQL database.

LLM Integration

The ruby_llm gem would provide the interface for sending retrieved SFL context to a language model for the final generation step.