Status & Roadmap¶

Last updated: Phase 66 completed (2026-04-23)

Current Best Results¶

Metric	Model	Value	Phase
LP MRR (DELTA-Full, temp)	Q: K-anneal + edge=7.0	0.4905	52
LP MRR (brain_hybrid)	A: d=0.01, seed=456	0.4956	58
LP MRR (brain_hybrid, 3-seed mean)	d=0.01, seeds 42/123/456	0.4844±0.0097	58
LP H@10 (brain_hybrid)	A: baseline @ d=0.01	0.8076	57
LP H@10 (DELTA-Full, temp)	S: anneal + edge=7.0	0.8045	52
3p MRR (multi-hop)	DELTA-Matched @10% drop	0.742 +/- 0.009	45 (3-seed)
5p MRR	DELTA-Matched @0% drop	0.790	44
Depth advantage (5p)	DELTA vs GraphGPS	+0.100	44
Per-query inference	DELTA vs GraphGPS	0.8–0.9x (faster)	45
LP MRR (1-layer DELTA, N=2000)	delta_1layer	0.3338 (val peak)	59
LP MRR (DistMult, N=2000)	DistMult (no GNN)	0.3185	59
LP test MRR (1-layer DELTA, N=5000)	delta_1layer (sub E_adj)	0.2404	62
LP test MRR (1-layer DELTA, N=5000, full E_adj)	delta_1layer (30M subsample, full softmax)	0.2471	63
LP test MRR (1-layer DELTA, N=5000, topk=128)	delta_1layer (63M, topk=128)	0.2472	64
LP test MRR (DistMult, N=5000)	DistMult (no GNN)	0.2244	62

What's Validated (23 Propositions)¶

#	Proposition	Confidence	Key Evidence
P1	Edge attention beats node attention on relational tasks	High	Phase 1, 9, 11, 13, 28
P2	Dual parallel attention is the key differentiator at high noise	High	Phase 28: +24% at extreme noise
P3	Graph structure adds value over transformers on relational tasks	High	Phase 27b: +4.4%; Phase 42: multi-hop dominance
P4	Soft gating achieves sparsity without accuracy loss	High	Phase 22/29: 100.0% at 50% sparsity
P5	DELTA beats CompGCN on real FB15k-237	High	Phase 25/29: 97.4% vs 96.9%
P6	Results stable across random seeds	High	Phase 29 (5 seeds), Phase 45 (3 seeds multi-hop)
P7	Sampling robustness at 26% VRAM budget	High	Phase 30: all 4 strategies within +/-0.2%
P8	DELTA competitive with GraphGPS on LP	High	Phase 40: SBHybrid MRR 0.5089 (within 0.004)
P9	DELTA excels on multi-hop compositional queries	High	Phase 42–44: only model with 2p->3p improvement; 5p MRR 0.790
P10	Advantage accelerates with reasoning depth	High	Phase 44: +0.004 at 2p -> +0.100 at 5p vs GraphGPS
P11	Per-query inference competitive with GraphGPS	High	Phase 45: 0.8–0.9x per query
P12	Learnable temperature discovers edge/node asymmetry	High	Phase 46–48: edge wants sharper, node wants softer
P13	Selective sharpening outperforms uniform	High	Phase 47: B (L0=1, L1+L2=4) best of 4 configs
P14	Asymmetric node/edge temp improves LP	High	Phase 48: E (node=2, edge=6) MRR 0.4856 (+1.5%)
P15	Differentiable graph construction is viable for LP	High	Phase 55–57: brain_hybrid matches delta_full MRR with +4.7% H@10
P16	Constructor density controls precision/recall trade-off	High	Phase 56: d=0.01 strictly dominates d=0.02 on MRR, H@3, H@10
P17	Temperature annealing is counterproductive on brain_hybrid	High	Phase 57: annealing trades H@10 for marginal MRR; baseline optimal
P18	d=0.01 is the optimal brain_hybrid density sweet spot	High	Phase 58: d=0.01 mean MRR=0.4844±0.0097 (robust); d=0.005 MRR=0.4673 (−0.017). Density optimization CLOSED.
P19	Edge-to-edge attention mechanism works at N=2000 (1-layer)	High	Phase 59: 1-layer DELTA val_MRR=0.3338 surpasses DistMult (0.3185). Mechanism viable, depth is the bottleneck.
P20	3-layer DELTA catastrophically over-smooths at N=2000	High	Phase 59: MRR=0.0018 across 3 training configs. 15.2M E_adj pairs × 3 layers = representation collapse.
P21	DELTA's test MRR advantage over DistMult is non-monotonic and modest at N=5000	High	Phase 62: gap=+0.004 (N=500), +0.076 (N=2000), +0.016 (N=5000). N=2000 spike inflated by DM overfitting.
P22	E_adj subsampling is a minor confound at N=5000, not the primary bottleneck	High	Phase 63: increasing retention from 23.8%→47.6% gains only +0.007 test MRR. Full retention (100%) gap vs DM=+0.021. Non-monotonic response: 47.6% > 100% > 71.4% > 23.8%.
P23	Top-k sparse attention at k=128 preserves full-attention quality; k=64 degrades −5.6%; k=256 OOM on 98GB	High	Phase 64: topk=128 test MRR=0.2472 matches Phase 63 full-softmax (0.2457, Δ=+0.0015). topk=64 test MRR=0.2332 (−5.6%). topk=256 OOM after ep25 on 98GB Blackwell. No epoch-time speedup at N=5000.

Known Issues¶

Phase 37 leakage — invalidated (5 evaluation bugs). Replaced by Phase 40.
Training cost — DELTA 34x slower per epoch than GraphGPS; inference is comparable.
LP/3p trade-off is fundamental — After 9 phases (46–54) testing 20+ temperature configurations, temperature reliably improves LP but has no statistically supported effect on multi-hop reasoning depth. Three operating modes: LP-optimized (P/Q), balanced-3p (K), deep-reasoning (N).
Multi-hop temperature claims not statistically robust — Phase 53 multi-seed validation shows single-seed 3p/4p/5p advantages are seed-dependent. Only LP improvements are statistically supported.
brain_hybrid overfitting — Both density conditions overfit after epoch 150. Early stopping around epoch 150–180 preserves best test performance.
3-layer depth over-smoothing at N=2000 — Phase 59: catastrophic collapse (MRR=0.002). 1-layer works (MRR=0.334). Depth management is the critical architectural challenge for scaling.

Open Gaps¶

Gap 1: Full-scale evaluation — SUBSAMPLING RESOLVED (Phase 63)¶

Phase 62 scaled to N=5000: DELTA test MRR=0.2404 vs DM=0.2244, gap=+0.016 (below hypothesized ≥0.04)
Scaling curve is non-monotonic: +0.004 (N=500) → +0.076 (N=2000) → +0.016 (N=5000)
N=2000 gap was inflated by DistMult catastrophic overfitting (val→test drop of 27%)
Phase 63 subsampling ablation: 4 conditions from 23.8% to 100% retention. Best test MRR: B (47.6%) = 0.2471, Δ vs baseline = +0.007. Subsampling is a minor confound.
Even at 100% E_adj retention (D), gap vs DM is only +0.021 — still below 0.04 threshold
Non-monotonic test MRR: 47.6% > 100% > 71.4% > 23.8% — attention dilution at high retention causes overfitting
Full FB15k-237 (14,505 entities, ~211M E_adj pairs) requires attention sparsification
Next: accept modest N=5000 advantage as genuine; investigate sparse attention or multi-head scaling

Gap 2: LP/3p trade-off — CHARACTERIZED (Phases 46–54)¶

After 9 phases and 20+ configurations, the trade-off is confirmed fundamental at the temperature level
Three operating modes: LP-optimized (Q: 0.4905), balanced-3p (K: 0.4148), deep-reasoning (N: 5p 0.3788)
Dynamic temperature (annealing) helps but doesn't fully resolve the trade-off

Gap 3: Cross-family generalization¶

WN18RR (transfer): 0.961 probe accuracy (Phase 35) — but frozen encoder, not trained LP
YAGO3-10: untested. Would demonstrate generalization beyond Freebase

Gap 4: Brain architecture optimization — SPARSE ATTENTION RESOLVED (Phase 64)¶

BrainEncoder validated (Phases 55–58) with MRR gains over delta_full marginal (+0.002 single-seed) but multi-seed mean 0.4844±0.0097 is robust
H@10 advantage (+4.7%) is substantial — constructed edges genuinely add recall
Density optimization CLOSED: d=0.01 (2,435 edges) is the sweet spot. d=0.02 too noisy, d=0.005 too sparse.
brain_hybrid OOMs at N=2000 (102K edges). Scaling brain architecture requires solving attention dilution first.
Phase 64 result: topk=128 sparse attention validated — matches full softmax quality (test MRR 0.2472 vs 0.2457). Memory limit at N=5000 lies between topk=128 (OK) and topk=256 (OOM on 98GB). Sparse attention mechanism is ready for the full Brain stack.
Phase 65 (deferred): Brain hybrid at N=5000 ran 1 epoch (816.7s, healthy signals) but was aborted due to compute cost (~$64/34hr). Brain is architecturally sound. Deferred to future work post-paper submission.

Gap 5: Depth management at scale — CLOSED (Phases 59–63)¶

Phase 59: 3-layer over-smooths catastrophically at N=2000 (MRR=0.002); 1-layer works (MRR=0.334)
Phase 60: Residual gating eliminates over-smoothing: 3L+gate MRR=0.3138 (174× vs ungated 0.0018)
However, all depths converge to MRR~0.31 — matching DistMult (0.3185) without any GNN
Gates frozen at ~10% layer / 90% residual → model learned layers are noise, not signal
Phase 61/61b: DELTA provides genuine but modest +0.017 val MRR advantage over DistMult at N=2000
Phase 62: At N=5000, gap narrows to +0.016 test MRR — advantage is real but doesn't scale
Phase 63: E_adj subsampling ablation confirms attention dilution, not subsampling, is the bottleneck
1-layer accepted as scaling architecture; depth provides no benefit at N≥2000
Phase 64: topk=128 sparse attention CONFIRMED — preserves full-attention quality (test MRR=0.2472 vs full softmax 0.2457). Dilution controlled. Memory ceiling at topk=128 (256 OOM on 98GB).

Gap 6: Hop-depth ablation — ADDRESSED BUT OPEN (Phase 66 → Phase 67)¶

Phase 66 (3 seeds × 500 epochs, RTX 3080 Ti): hops=2 vs hops=1 gap on N=500 dense subgraph is within 1σ (2p=+0.005, 3p=+0.002). node_only outperforms both. REJECTED on dense subgraph (mean degree ≈19.7).
Dense subgraph result does not answer the question for the sparse full graph: at mean degree ≈4.1, 1-hop edge adjacency covers only ~16 pairs per edge — leaving far more room for 2-hop to add signal.
Phase 67 (PLANNED): repeat on full sparse FB15k-237 (14.5K entities) with topk=128 sparse attention + NBFNet/A*Net baselines. Requires RunPod H100.

Gap 7: Standard benchmark comparison — FUTURE (Phase 68)¶

All multi-hop results use custom chain query generator — not directly comparable to published BetaE/GQE results
Phase 68 (PLANNED): BetaE 9-query-type evaluation (1p/2p/3p/2i/3i/ip/pi/2u/up) on standard BetaE splits
Dataset: https://snap.stanford.edu/betae/KG_data.zip

Gap 8: Sequence domain generalization — FUTURE¶

All current evidence is on knowledge graphs where structure is pre-defined or semi-explicit
BrainEncoder's Gumbel-sigmoid construction (Phases 55–58) is the mechanism for sequence domains
Prerequisites: Phase 67 (full graph), Phase 68 (BetaE benchmark), then LRA ListOps pilot

Roadmap¶

Horizon 2: Adaptive Architecture (Phases 46–57) — Complete¶

Phase	Goal	Status
46	Learnable per-head temperature	Done — dead heads 83%->38%, edge/node asymmetry
47	Layer-specific temperature	Done — B (L0 soft, L1+L2 sharp) = best LP 0.4783
48	Asymmetric node/edge temperature	Done — E = LP record 0.4856; node stable, edge drifts UP
49	L0 temperature + asymmetric L1+L2	Done — H achieves LP 0.4887 but 3p still below D
50	Temperature annealing	Done — K breaks 3p ceiling (0.4148), first to beat D's 0.4018
51	Static vs trajectory	Done — annealing trajectory creates representations static init cannot replicate
52	Edge sharpness + closing LP gap	Done — Q achieves LP 0.4905 (record); LP/3p trade-off confirmed fundamental
53	Multi-seed validation	Done — multi-hop claims are seed-dependent; only LP is robust
54	High-power multi-hop eval	Done — 10k queries confirm evaluation noise dominates; investigation CLOSED
55	Brain architecture port	Done — BrainEncoder MRR 0.4773, H@10 +3.7% over delta_full
56	Constructor density ablation	Done — d=0.01 strictly dominates d=0.02
57	Brain temperature annealing	Done — baseline (no annealing) optimal; MRR 0.4808–0.4818
58	Multi-seed density validation	Done — d=0.01 robust (mean MRR 0.4844±0.0097); d=0.005 fails (−0.017). Density CLOSED.

Horizon 3: Sparse Attention & Full Brain Stack (Phases 59–67+) — Active¶

KG scaling investigation (Phases 59–63) CLOSED. Key finding: attention dilution — not subsampling — is the real scaling bottleneck. At N=5000, each edge attends to ~15M adjacency pairs; signal-to-noise collapses. Phase 63 subsampling ablation confirmed subsampling is a minor confound (+0.007 at best). Non-monotonic response (47.6% > 100% > 71.4% > 23.8%) proves dilution causes the performance ceiling.

Strategic direction: Path C — sparse attention bridges KG completion and Brain activation.

Phase	Goal	Status
59	Depth scaling at N=2000	Done — 1-layer surpasses DistMult; 3-layer catastrophically over-smooths
60	Residual gating for depth	Done — eliminates over-smoothing but layers contribute nothing
61/61b	DELTA vs DistMult controlled comparison (N=2000)	Done — genuine +0.017 val MRR advantage confirmed
62	Scale to N=5000	Done — advantage real but modest (+0.016 test MRR)
63	E_adj subsampling ablation (N=5000)	Done — attention dilution confirmed as bottleneck, not subsampling
64	Top-k sparse edge-to-edge attention	Done — topk=128 matches full softmax (MRR=0.2472); topk=64 −5.6%; topk=256 OOM
65	Full Brain stack activation	Deferred — 1 epoch ran (816.7s, healthy), aborted at ~$64/34hr compute cost. Engineering fixes committed.
66	1-hop vs 2-hop edge adjacency ablation	Done — REJECTED on dense N=500 subgraph (hops=2 vs hops=1 within 1σ). Paper table updated with actual numbers.
67	Full sparse FB15k-237 hop-depth ablation + NBFNet baselines	Planned — RunPod H100, topk=128 sparse attention, 3 seeds
68	BetaE 9-query-type benchmark	Planned — standard comparison to published BetaE/GQE results
69+	Brain N=5000 retry + iterative refinement	Planned — post-paper-submission

Horizon 4: Dynamic Reasoning — Future¶

Iterative graph refinement, temporal reasoning, multi-scale construction. See The Brain.

Horizon 5: The Brain¶

Multi-modal construction, associative memory, compositional generalization. See The Brain.

Publication Pathway¶

Target: NeurIPS / ICLR — "DELTA: Dual Edge-Linked Transformer Architecture for Compositional Reasoning on Knowledge Graphs"

What We Have¶

[x] Novel architecture with theoretical motivation
[x] 63 experiment phases with honest failure documentation
[x] Multi-seed statistical validation (Phases 29, 45, 53)
[x] Competitive LP on real FB15k-237 (Phases 40, 48, 52)
[x] Multi-hop compositional dominance (Phases 42–44)
[x] Inference efficiency story (Phase 45)
[x] Learnable temperature with mechanistic insight (Phases 46–52)
[x] Self-bootstrap result (Phase 39: 157% of FixedChain)
[x] Differentiable graph construction via BrainEncoder (Phases 55–57)
[x] LP/3p trade-off fully characterized (Phases 46–54)

What We Still Need¶

[ ] Sparse attention mechanism to address scaling bottleneck — Gap 4/5
[ ] Cross-family benchmark (YAGO3-10 or WN18RR LP) — Gap 3
[ ] Sequence domain evaluation (LRA ListOps) — Gap 6
[ ] Interpretability figure (attention heatmap on known reasoning chain)
[ ] Multi-seed brain_hybrid validation for statistical confidence

Paper Structure¶

Introduction — The three-paradigm gap (Transformers -> GNNs -> DELTA); edges as first-class computational citizens; self-bootstrapped graph construction
Related Work — Message-passing GNNs (CompGCN, RGCN); Transformers on graphs (GraphGPS, GRIT); KG completion (TransE, RotatE, BetaE)
Architecture — DualParallelAttention; 2-hop edge adjacency; ReconciliationBridge; BrainEncoder (differentiable construction); learnable per-head temperature with edge/node asymmetry
Experiments — Setup (FB15k-237 subset, evaluation protocol); LP results (Phases 40, 48, 52); multi-hop path queries (Phases 42–44, 1p–5p); robustness (Phase 43 DropEdge, Phase 45 multi-seed); temperature analysis (Phases 46–54); brain architecture (Phases 55–57)
Inference Efficiency — Phase 45: per-query faster than GraphGPS
Self-Bootstrap & Brain — Phase 39: 157% of FixedChain; Phases 55–57: differentiable construction matches delta_full with +4.7% H@10
Conclusion — The Brain vision

See The Brain for long-term vision. See Validation Phases for all experiment results. See Key Findings for the 38 consolidated findings.