Home / Blog / AI Models

LinearRAG on a single RTX 4090: relation-free GraphRAG, hands-on

LinearRAG (ICLR 2026) builds a reasoning graph over your documents using only entity recognition and embeddings — zero LLM tokens at build time. No brittle relation extraction, just a Tri-Graph and a two-stage retrieval that does multi-hop reasoning in a single pass. I ran the whole thing end-to-end on one 24 GB RTX 4090 — here is how it works, three real multi-hop examples, and the honest setup story with gpt-5-mini.

The LinearRAG Tri-Graph: three stacked layers of nodes — blue entities on top, violet sentences in the middle, green passages at the bottom — connected into a tidy knowledge graph
The LinearRAG Tri-Graph — entities (blue), sentences (violet) and passages (green), built with no LLM relation extraction

TL;DR

Conventional GraphRAG asks an LLM to read every passage and extract relation triples. That is expensive and noisy — extractors misread facts ("Einstein did not win the Nobel" becomes "Einstein won Nobel") and never reconcile triples across the corpus. LinearRAG's claim is blunt: explicit relation extraction is unnecessary. Shared entities are the anchors that connect passages, and the relationships are already preserved, in context, in the original text.

So LinearRAG builds a relation-free Tri-Graph — entity, sentence and passage nodes — using only spaCy NER and sentence embeddings. No LLM is called during construction, so indexing scales linearly with the corpus and costs zero tokens. Retrieval is two stages: semantic bridging spreads activation from query entities through shared sentences to multi-hop "bridge" entities, then Personalized PageRank ranks the passages.

On a 24 GB 4090 it runs comfortably: GPU sentence embeddings, an optional --use_vectorized_retrieval sparse-matrix path, and an answer model of your choice. Swapping the reader from gpt-4o-mini to gpt-5-mini took my 12-question demo from 0.67 → 1.00 LLM-judge accuracy on the same retrieved passages. The one real cost is a one-time CPU spaCy NER pass at index time — cached afterward.

What LinearRAG is, in plain words

Retrieval-augmented generation grounds a language model in your own documents so it hallucinates less. That is easy when the answer sits in one passage. It gets hard on multi-hop questions — "when did the performer of this song's mother die?" — where the evidence is scattered across several documents and no single chunk holds the answer.

GraphRAG tackles this by building a knowledge graph: entities become nodes, and the relationships between them become edges, so retrieval can walk multiple hops. The standard recipe builds those edges by having an LLM read each passage and emit structured relation triples. LinearRAG keeps the graph idea but throws away the relation extraction — and with it, the cost and the noise.

Two passage cards joined by one shared, glowing entity node, with a caption noting that relations stay in the text
The key insight: shared entities are the anchors that connect passages — the relations stay in the original text
The one-sentence version. Don't ask an LLM to extract relationships into a graph at build time; let entities link the passages, keep the original sentences, and let the LLM interpret the relationships at answer time, when it reads the retrieved text.

Why conventional GraphRAG breaks

The LinearRAG paper opens with an uncomfortable finding: on many real tasks, GraphRAG systems underperform plain vanilla RAG. The culprit is the automatically-constructed graph, which fails in two distinct ways.

A knowledge graph of entity nodes connected by labeled relation edges, with several edges marked by red warning triangles to indicate noisy, error-prone LLM extraction
LLM-extracted relation edges are error-prone — wrong triples (red) corrupt the graph and mislead retrieval

Local inaccuracy. Relation extraction routinely misreads the text. Negations get dropped, compositional clauses get flattened into a single atomic triple, and the meaning inverts. Every wrong triple is a wrong edge that quietly distorts retrieval.

Global inconsistency. Triples are extracted from each passage in isolation, with no mechanism to reconcile them corpus-wide. The same entity ends up linked inconsistently across documents, and redundant or contradictory edges accumulate. Bottom-up community clustering on top of a noisy graph only propagates the errors upward.

And it isn't cheap. Every passage in the corpus has to pass through the LLM extractor, so the token bill — and the indexing time — grows with corpus size. LinearRAG's pitch is that you are paying that bill to make retrieval worse.

The Tri-Graph — built with zero LLM tokens

LinearRAG constructs a three-layer graph it calls the Tri-Graph. There are three kinds of nodes — entity, sentence and passage — and just two relations between them: a contain edge when a passage holds an entity, and a mention edge when a sentence mentions one. The diagram below lays out all three layers at once.

Diagram of the Tri-Graph: a blue entity layer, a violet sentence layer and a green passage layer, joined by labeled contain and mention edges, with a 'zero LLM tokens' badge
The Tri-Graph — entity / sentence / passage layers joined by contain & mention edges, with the relation words living on the sentence pills, not on graph edges

Construction uses only two cheap, deterministic tools: spaCy NER to pull entities out of every passage and sentence, and a sentence-transformer to embed the text. No language model is involved, so there is no token cost and the work is linear in the corpus size. The embeddings are cached to parquet, so re-indexing only touches new documents.

# Construction = spaCy NER + sentence embeddings. No LLM, no relation triples.
ner = SpacyNER("en_core_web_trf")
passage_entities, sentence_entities = ner.batch_ner(passages, max_workers)

# contain edges: passage -> entity  (weighted by normalized mention count)
for passage, entities in passage_entities.items():
    for ent in entities:
        graph.add_edge(passage, ent, weight=count(ent) / total)

# mention edges: sentence <-> entity  (these drive multi-hop bridging)
for sentence, entities in sentence_entities.items():
    for ent in entities:
        entity_to_sentence[ent].add(sentence)
Why it scales. Because the two relations are just sparse contain and mention matrices, the index is linear in time and space — and the paper reports cutting indexing time by over 77% versus relation-extraction GraphRAG, while keeping every original passage as a lossless knowledge carrier.

Two-stage retrieval: bridging, then PageRank

With the graph built, answering a question is two stages. The first is the clever part — relevant entity activation via semantic bridging — and it is how LinearRAG does multi-hop reasoning without any explicit relation edges.

Diagram of semantic bridging: a query activates a seed entity, the path runs through shared sentence pills to light up bridge entities (one weak branch pruned below threshold), then feeds a PageRank-ranked list of passages
The two-stage flow — activation bridges from the query seed through shared sentences to multi-hop "bridge" entities (weak paths pruned), then Personalized PageRank ranks the passages

The query's entities are matched to seed entity nodes. Activation then spreads hop by hop: a lit entity activates the sentences that mention it; the sentence most relevant to the query (by embedding similarity) lights up; and the other entities in that sentence become newly-activated bridge entities. A threshold prunes weak paths, and after a few iterations you have a small, query-focused set of entities — the multi-hop chain — without ever consulting a relation edge.

Stage two is global importance aggregation via Personalized PageRank. The activated entities seed a PPR run over the passage–entity subgraph, with a hybrid initialization that blends each passage's direct similarity to the query with the evidence accumulated from its entities. The top-ranked passages go to the LLM, which reads them and writes the answer.

Diagram of Personalized PageRank: a cluster of activated entity nodes feeds a ranked list of passage cards, rank 1 brightest
Stage 2 — activated entities seed Personalized PageRank, which ranks the passages handed to the reader model

Three worked examples (2WikiMultiHop)

The best way to feel the mechanism is to trace it on real questions. These three are from the 2WikiMultiHop set I indexed (658 passages), and each has a different graph shape.

① Compositional · a two-hop chain

2-HOP

Q: "When did Lothair II's mother die?"

Diagram: seed entity Lothair II bridges via a 'mother' sentence to Ermengarde of Tours, then via 'date of death' to the answer 20 March 851
Lothair II → (mother) → Ermengarde of Tours → (date of death) → 20 March 851

What happened

No single passage states the answer. LinearRAG seeds Lothair II, bridges through a shared sentence to his mother Ermengarde of Tours, then bridges again to her date of death. Genuine two-hop reasoning, in one retrieval pass, with no relation edges.

② Comparison · two parallel chains

PARALLEL

Q: "Which film was released first, Aas Ka Panchhi or Phoolwari?"

Diagram: two seed entities in parallel, one bridging to 1961 and the other to 1946, converging on a compare node where 1946 wins
Two parallel chains → 1961 vs 1946 → compare → Phoolwari (1946)

What happened

A comparison question lights up two seeds at once. One chain bridges to 1961, the other to 1946; both feed a compare step, and the earlier date wins. Notice the graph shape is different from example ① — two parallel chains converging, rather than one linear bridge.

③ Compositional · song → performer → birthplace

2-HOP

Q: "What is the place of birth of the performer of the song 'Changed It'?"

Diagram: the song Changed It bridges via 'performer' to Nicki Minaj, then via 'place of birth' to Port of Spain
Changed It → (performer) → Nicki Minaj → (place of birth) → Port of Spain

What happened

Same compositional pattern as ①, different domain: seed the song, bridge through the performer relation to Nicki Minaj, then through place of birth to the answer. Three questions, three traversal shapes, one mechanism — and not a single hand-built relation.

Running it end-to-end on a 4090

LinearRAG prefers Python 3.9 (its pins are old). Install a CUDA build of PyTorch first so the embeddings and the vectorized retrieval land on the GPU, then the repo's requirements and the spaCy transformer model.

# Python 3.9 venv (uv). GPU torch FIRST, then the repo pins.
uv venv .venv --python 3.9.19 && source .venv/bin/activate
uv pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
uv pip install -r requirements.txt
uv pip install \
  'https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.6.1/en_core_web_trf-3.6.1-py3-none-any.whl'

Datasets come from the project's HuggingFace repo. I used 2WikiMultiHop (standard spaCy model, 658 passages); the four bundled sets span compositional, comparison and multi-hop reasoning.

Four labeled dataset cards: 2WikiMultiHop, HotpotQA, MuSiQue and Medical
The four bundled benchmarks — I ran 2WikiMultiHop; Medical needs a scientific spaCy model

One full command indexes, retrieves, answers and evaluates. The first run pays the one-time spaCy NER pass (CPU, then cached to import/); every run after that reuses the cache and retrieves in seconds.

python run.py \
    --spacy_model en_core_web_trf \
    --embedding_model sentence-transformers/all-mpnet-base-v2 \
    --dataset_name 2wikimultihop \
    --llm_model gpt-5-mini \
    --max_workers 16 \
    --use_vectorized_retrieval        # GPU sparse-matrix retrieval

To make it tangible I wired up a small Gradio demo over the cached index: type any question and it returns the answer alongside the PPR-ranked evidence passages and their scores.

Screenshot of the LinearRAG Gradio demo answering 'When did Lothair II's mother die?' with '20 March 851' and a ranked list of evidence passages
The Gradio demo over the cached index — answer plus the Personalized-PageRank-ranked evidence

Modes I tested, and the results

Vectorized GPU retrieval. The semantic-bridging step ships twice: a readable CPU breadth-first-search reference, and a vectorized version that expresses the same propagation as sparse matrix multiplications on CUDA. The --use_vectorized_retrieval flag switches between them; on the 4090, with the index cached, 12 queries retrieved in about 1.5 seconds.

The reader model matters as much as retrieval. On the same retrieved passages, swapping the answer model lifted accuracy sharply. (GPT-5-family models need max_completion_tokens instead of max_tokens and reject temperature=0 — a one-line client patch.)

Reader modelLLM-judge accuracyContain accuracy
gpt-4o-mini0.667 (8/12)0.833 (10/12)
gpt-5-mini1.000 (12/12)0.917 (11/12)

Attribute-query fallback. Pure entity bridging can miss simple attribute lookups (where was X born?). An opt-in hybrid mode boosts passages that share attribute keywords with the query — off by default, enabled through the config object.

config = LinearRAGConfig(
    dataset_name="2wikimultihop",
    enable_hybrid_attribute_fallback=True,   # default: False
    attribute_keyword_boost=0.25,            # born / died / located / founded ...
)

Evaluation. After answering, LinearRAG scores predictions two ways in parallel — an LLM judge (correct / incorrect against the gold answer) and a strict substring contain check — writing everything to a timestamped results/ folder. The knobs worth sweeping first on a new dataset are iteration_threshold, max_iterations, passage_ratio and top_k_sentence.

A control panel of labeled sliders and dials for iteration_threshold, max_iterations, passage_ratio, top_k_sentence and damping
The tuning knobs — the repo ships per-dataset values; MuSiQue needs deeper bridging (5 iterations, low threshold)

Honest verdict

LinearRAG is the rare paper whose central move is to remove something — relation extraction — and come out ahead on cost, speed and quality. The Tri-Graph is genuinely cheap to build, the semantic-bridging idea is an elegant way to get single-pass multi-hop reasoning out of a relation-free graph, and on a 24 GB 4090 the whole pipeline is comfortable, with the GPU doing the embeddings and the sparse-matrix retrieval.

The one honest caveat: spaCy's transformer NER runs on CPU by the repo's design and is the indexing bottleneck — about six minutes for 658 passages on my machine. It is a one-time cost (the result caches to import/), but on a large corpus it is the thing to plan around. Everything that matters at query time — embeddings, bridging, PageRank — is fast and GPU-friendly. For multi-hop RAG on a single workstation GPU, this is one of the most pragmatic designs I've run.

Bottom line. If you've been put off GraphRAG by its token bill and noisy graphs, LinearRAG is the counter-argument: keep the graph, drop the relation extraction, and let entities plus the original text do the work.

References & links