Beyond Keyword Search
Traditional Retrieval-Augmented Generation (RAG) systems find relevant text chunks, but they don't understand the language within them. We propose a Linguistic Query Engine that models language structure to enable far more precise, meaningful, and powerful queries.
Traditional RAG
User Query
"Did the project fail?"
Semantic Search
Finds vectors close to the query.
Retrieved Text Chunks
"The project launch was a success..."
"...initial failure was overcome..."
"...a project to mitigate failure..."
This approach often retrieves documents with matching keywords but conflicting or irrelevant meanings, leading to inaccurate answers.
Linguistic Query Engine
User Query
"Did the project fail?"
Linguistic Parsing
Analyzes grammar, roles, and context.
Structured Query on Model
SELECT * WHERE
Participant = 'project'
Process = 'fail'
Polarity = 'positive'
By querying a structured model of language, we can precisely target concepts and their relationships, ignoring superficial keyword matches.
A Dual-Model Architecture
The engine's power comes from integrating two linguistic theories: Universal Grammar (UG) for timeless, stable structure, and Systemic Functional Linguistics (SFL) for dynamic, contextual meaning.
Firmare (The Deep Structure - UG)
Represents the core, abstract knowledge. It's the stable "hardware" of meaning, defining concepts and their relationships independent of how they are expressed.
Proposition
The core unit of knowledge. A timeless, abstract claim. (e.g., The concept 'Acme Corp' is related to the concept 'acquire' which is related to the concept 'Innovate Inc').
Concept
The atomic entities (nouns, ideas) that build propositions. The building blocks of the ontology.
Software (The Contextual Meaning - SFL)
Represents a single, concrete communicative act. It's the dynamic "software" that runs on the hardware, capturing who is saying what to whom, and why.
Discourse Event
The central hub for context. Represents a specific utterance like an email, a sentence in a report, etc.
Interpersonal Stance
Captures the "Tenor" - the relationship, attitude, and certainty of the speaker.
The Critical Bridge: Text Chunk
The `TextChunk` model is the linchpin that connects the two worlds. It links a specific piece of text (e.g., "The company announced a layoff") to both its dynamic SFL context (who announced it, when, with what certainty) and the timeless UG propositions it represents (company -> layoff).
Hybrid Processing Pipeline
Raw text is transformed into the structured dual-model through a five-stage automated pipeline, combining traditional computational linguistics with modern LLM capabilities.
Integrated Database Schema
The architecture is realized as a set of interconnected database tables. Explore the models below to see how linguistic concepts are stored.
Unlocking Powerful Queries
This structured approach enables queries that are impossible for systems based on semantic similarity alone. See the difference in specificity and power.