Skip to content

Status & Roadmap

Last updated: Phase 66 completed (2026-04-23)


Current Best Results

Metric Model Value Phase
LP MRR (DELTA-Full, temp) Q: K-anneal + edge=7.0 0.4905 52
LP MRR (brain_hybrid) A: d=0.01, seed=456 0.4956 58
LP MRR (brain_hybrid, 3-seed mean) d=0.01, seeds 42/123/456 0.4844±0.0097 58
LP H@10 (brain_hybrid) A: baseline @ d=0.01 0.8076 57
LP H@10 (DELTA-Full, temp) S: anneal + edge=7.0 0.8045 52
3p MRR (multi-hop) DELTA-Matched @10% drop 0.742 +/- 0.009 45 (3-seed)
5p MRR DELTA-Matched @0% drop 0.790 44
Depth advantage (5p) DELTA vs GraphGPS +0.100 44
Per-query inference DELTA vs GraphGPS 0.8–0.9x (faster) 45
LP MRR (1-layer DELTA, N=2000) delta_1layer 0.3338 (val peak) 59
LP MRR (DistMult, N=2000) DistMult (no GNN) 0.3185 59
LP test MRR (1-layer DELTA, N=5000) delta_1layer (sub E_adj) 0.2404 62
LP test MRR (1-layer DELTA, N=5000, full E_adj) delta_1layer (30M subsample, full softmax) 0.2471 63
LP test MRR (1-layer DELTA, N=5000, topk=128) delta_1layer (63M, topk=128) 0.2472 64
LP test MRR (DistMult, N=5000) DistMult (no GNN) 0.2244 62

What's Validated (23 Propositions)

# Proposition Confidence Key Evidence
P1 Edge attention beats node attention on relational tasks High Phase 1, 9, 11, 13, 28
P2 Dual parallel attention is the key differentiator at high noise High Phase 28: +24% at extreme noise
P3 Graph structure adds value over transformers on relational tasks High Phase 27b: +4.4%; Phase 42: multi-hop dominance
P4 Soft gating achieves sparsity without accuracy loss High Phase 22/29: 100.0% at 50% sparsity
P5 DELTA beats CompGCN on real FB15k-237 High Phase 25/29: 97.4% vs 96.9%
P6 Results stable across random seeds High Phase 29 (5 seeds), Phase 45 (3 seeds multi-hop)
P7 Sampling robustness at 26% VRAM budget High Phase 30: all 4 strategies within +/-0.2%
P8 DELTA competitive with GraphGPS on LP High Phase 40: SBHybrid MRR 0.5089 (within 0.004)
P9 DELTA excels on multi-hop compositional queries High Phase 42–44: only model with 2p->3p improvement; 5p MRR 0.790
P10 Advantage accelerates with reasoning depth High Phase 44: +0.004 at 2p -> +0.100 at 5p vs GraphGPS
P11 Per-query inference competitive with GraphGPS High Phase 45: 0.8–0.9x per query
P12 Learnable temperature discovers edge/node asymmetry High Phase 46–48: edge wants sharper, node wants softer
P13 Selective sharpening outperforms uniform High Phase 47: B (L0=1, L1+L2=4) best of 4 configs
P14 Asymmetric node/edge temp improves LP High Phase 48: E (node=2, edge=6) MRR 0.4856 (+1.5%)
P15 Differentiable graph construction is viable for LP High Phase 55–57: brain_hybrid matches delta_full MRR with +4.7% H@10
P16 Constructor density controls precision/recall trade-off High Phase 56: d=0.01 strictly dominates d=0.02 on MRR, H@3, H@10
P17 Temperature annealing is counterproductive on brain_hybrid High Phase 57: annealing trades H@10 for marginal MRR; baseline optimal
P18 d=0.01 is the optimal brain_hybrid density sweet spot High Phase 58: d=0.01 mean MRR=0.4844±0.0097 (robust); d=0.005 MRR=0.4673 (−0.017). Density optimization CLOSED.
P19 Edge-to-edge attention mechanism works at N=2000 (1-layer) High Phase 59: 1-layer DELTA val_MRR=0.3338 surpasses DistMult (0.3185). Mechanism viable, depth is the bottleneck.
P20 3-layer DELTA catastrophically over-smooths at N=2000 High Phase 59: MRR=0.0018 across 3 training configs. 15.2M E_adj pairs × 3 layers = representation collapse.
P21 DELTA's test MRR advantage over DistMult is non-monotonic and modest at N=5000 High Phase 62: gap=+0.004 (N=500), +0.076 (N=2000), +0.016 (N=5000). N=2000 spike inflated by DM overfitting.
P22 E_adj subsampling is a minor confound at N=5000, not the primary bottleneck High Phase 63: increasing retention from 23.8%→47.6% gains only +0.007 test MRR. Full retention (100%) gap vs DM=+0.021. Non-monotonic response: 47.6% > 100% > 71.4% > 23.8%.
P23 Top-k sparse attention at k=128 preserves full-attention quality; k=64 degrades −5.6%; k=256 OOM on 98GB High Phase 64: topk=128 test MRR=0.2472 matches Phase 63 full-softmax (0.2457, Δ=+0.0015). topk=64 test MRR=0.2332 (−5.6%). topk=256 OOM after ep25 on 98GB Blackwell. No epoch-time speedup at N=5000.

Known Issues

  • Phase 37 leakage — invalidated (5 evaluation bugs). Replaced by Phase 40.
  • Training cost — DELTA 34x slower per epoch than GraphGPS; inference is comparable.
  • LP/3p trade-off is fundamental — After 9 phases (46–54) testing 20+ temperature configurations, temperature reliably improves LP but has no statistically supported effect on multi-hop reasoning depth. Three operating modes: LP-optimized (P/Q), balanced-3p (K), deep-reasoning (N).
  • Multi-hop temperature claims not statistically robust — Phase 53 multi-seed validation shows single-seed 3p/4p/5p advantages are seed-dependent. Only LP improvements are statistically supported.
  • brain_hybrid overfitting — Both density conditions overfit after epoch 150. Early stopping around epoch 150–180 preserves best test performance.
  • 3-layer depth over-smoothing at N=2000 — Phase 59: catastrophic collapse (MRR=0.002). 1-layer works (MRR=0.334). Depth management is the critical architectural challenge for scaling.

Open Gaps

Gap 1: Full-scale evaluation — SUBSAMPLING RESOLVED (Phase 63)

  • Phase 62 scaled to N=5000: DELTA test MRR=0.2404 vs DM=0.2244, gap=+0.016 (below hypothesized ≥0.04)
  • Scaling curve is non-monotonic: +0.004 (N=500) → +0.076 (N=2000) → +0.016 (N=5000)
  • N=2000 gap was inflated by DistMult catastrophic overfitting (val→test drop of 27%)
  • Phase 63 subsampling ablation: 4 conditions from 23.8% to 100% retention. Best test MRR: B (47.6%) = 0.2471, Δ vs baseline = +0.007. Subsampling is a minor confound.
  • Even at 100% E_adj retention (D), gap vs DM is only +0.021 — still below 0.04 threshold
  • Non-monotonic test MRR: 47.6% > 100% > 71.4% > 23.8% — attention dilution at high retention causes overfitting
  • Full FB15k-237 (14,505 entities, ~211M E_adj pairs) requires attention sparsification
  • Next: accept modest N=5000 advantage as genuine; investigate sparse attention or multi-head scaling

Gap 2: LP/3p trade-off — CHARACTERIZED (Phases 46–54)

  • After 9 phases and 20+ configurations, the trade-off is confirmed fundamental at the temperature level
  • Three operating modes: LP-optimized (Q: 0.4905), balanced-3p (K: 0.4148), deep-reasoning (N: 5p 0.3788)
  • Dynamic temperature (annealing) helps but doesn't fully resolve the trade-off

Gap 3: Cross-family generalization

  • WN18RR (transfer): 0.961 probe accuracy (Phase 35) — but frozen encoder, not trained LP
  • YAGO3-10: untested. Would demonstrate generalization beyond Freebase

Gap 4: Brain architecture optimization — SPARSE ATTENTION RESOLVED (Phase 64)

  • BrainEncoder validated (Phases 55–58) with MRR gains over delta_full marginal (+0.002 single-seed) but multi-seed mean 0.4844±0.0097 is robust
  • H@10 advantage (+4.7%) is substantial — constructed edges genuinely add recall
  • Density optimization CLOSED: d=0.01 (2,435 edges) is the sweet spot. d=0.02 too noisy, d=0.005 too sparse.
  • brain_hybrid OOMs at N=2000 (102K edges). Scaling brain architecture requires solving attention dilution first.
  • Phase 64 result: topk=128 sparse attention validated — matches full softmax quality (test MRR 0.2472 vs 0.2457). Memory limit at N=5000 lies between topk=128 (OK) and topk=256 (OOM on 98GB). Sparse attention mechanism is ready for the full Brain stack.
  • Phase 65 (deferred): Brain hybrid at N=5000 ran 1 epoch (816.7s, healthy signals) but was aborted due to compute cost (~$64/34hr). Brain is architecturally sound. Deferred to future work post-paper submission.

Gap 5: Depth management at scale — CLOSED (Phases 59–63)

  • Phase 59: 3-layer over-smooths catastrophically at N=2000 (MRR=0.002); 1-layer works (MRR=0.334)
  • Phase 60: Residual gating eliminates over-smoothing: 3L+gate MRR=0.3138 (174× vs ungated 0.0018)
  • However, all depths converge to MRR~0.31 — matching DistMult (0.3185) without any GNN
  • Gates frozen at ~10% layer / 90% residual → model learned layers are noise, not signal
  • Phase 61/61b: DELTA provides genuine but modest +0.017 val MRR advantage over DistMult at N=2000
  • Phase 62: At N=5000, gap narrows to +0.016 test MRR — advantage is real but doesn't scale
  • Phase 63: E_adj subsampling ablation confirms attention dilution, not subsampling, is the bottleneck
  • 1-layer accepted as scaling architecture; depth provides no benefit at N≥2000
  • Phase 64: topk=128 sparse attention CONFIRMED — preserves full-attention quality (test MRR=0.2472 vs full softmax 0.2457). Dilution controlled. Memory ceiling at topk=128 (256 OOM on 98GB).

Gap 6: Hop-depth ablation — ADDRESSED BUT OPEN (Phase 66 → Phase 67)

  • Phase 66 (3 seeds × 500 epochs, RTX 3080 Ti): hops=2 vs hops=1 gap on N=500 dense subgraph is within 1σ (2p=+0.005, 3p=+0.002). node_only outperforms both. REJECTED on dense subgraph (mean degree ≈19.7).
  • Dense subgraph result does not answer the question for the sparse full graph: at mean degree ≈4.1, 1-hop edge adjacency covers only ~16 pairs per edge — leaving far more room for 2-hop to add signal.
  • Phase 67 (PLANNED): repeat on full sparse FB15k-237 (14.5K entities) with topk=128 sparse attention + NBFNet/A*Net baselines. Requires RunPod H100.

Gap 7: Standard benchmark comparison — FUTURE (Phase 68)

  • All multi-hop results use custom chain query generator — not directly comparable to published BetaE/GQE results
  • Phase 68 (PLANNED): BetaE 9-query-type evaluation (1p/2p/3p/2i/3i/ip/pi/2u/up) on standard BetaE splits
  • Dataset: https://snap.stanford.edu/betae/KG_data.zip

Gap 8: Sequence domain generalization — FUTURE

  • All current evidence is on knowledge graphs where structure is pre-defined or semi-explicit
  • BrainEncoder's Gumbel-sigmoid construction (Phases 55–58) is the mechanism for sequence domains
  • Prerequisites: Phase 67 (full graph), Phase 68 (BetaE benchmark), then LRA ListOps pilot

Roadmap

Horizon 2: Adaptive Architecture (Phases 46–57) — Complete

Phase Goal Status
46 Learnable per-head temperature Done — dead heads 83%->38%, edge/node asymmetry
47 Layer-specific temperature Done — B (L0 soft, L1+L2 sharp) = best LP 0.4783
48 Asymmetric node/edge temperature Done — E = LP record 0.4856; node stable, edge drifts UP
49 L0 temperature + asymmetric L1+L2 Done — H achieves LP 0.4887 but 3p still below D
50 Temperature annealing Done — K breaks 3p ceiling (0.4148), first to beat D's 0.4018
51 Static vs trajectory Done — annealing trajectory creates representations static init cannot replicate
52 Edge sharpness + closing LP gap Done — Q achieves LP 0.4905 (record); LP/3p trade-off confirmed fundamental
53 Multi-seed validation Done — multi-hop claims are seed-dependent; only LP is robust
54 High-power multi-hop eval Done — 10k queries confirm evaluation noise dominates; investigation CLOSED
55 Brain architecture port Done — BrainEncoder MRR 0.4773, H@10 +3.7% over delta_full
56 Constructor density ablation Done — d=0.01 strictly dominates d=0.02
57 Brain temperature annealing Done — baseline (no annealing) optimal; MRR 0.4808–0.4818
58 Multi-seed density validation Done — d=0.01 robust (mean MRR 0.4844±0.0097); d=0.005 fails (−0.017). Density CLOSED.

Horizon 3: Sparse Attention & Full Brain Stack (Phases 59–67+) — Active

KG scaling investigation (Phases 59–63) CLOSED. Key finding: attention dilution — not subsampling — is the real scaling bottleneck. At N=5000, each edge attends to ~15M adjacency pairs; signal-to-noise collapses. Phase 63 subsampling ablation confirmed subsampling is a minor confound (+0.007 at best). Non-monotonic response (47.6% > 100% > 71.4% > 23.8%) proves dilution causes the performance ceiling.

Strategic direction: Path C — sparse attention bridges KG completion and Brain activation.

Phase Goal Status
59 Depth scaling at N=2000 Done — 1-layer surpasses DistMult; 3-layer catastrophically over-smooths
60 Residual gating for depth Done — eliminates over-smoothing but layers contribute nothing
61/61b DELTA vs DistMult controlled comparison (N=2000) Done — genuine +0.017 val MRR advantage confirmed
62 Scale to N=5000 Done — advantage real but modest (+0.016 test MRR)
63 E_adj subsampling ablation (N=5000) Done — attention dilution confirmed as bottleneck, not subsampling
64 Top-k sparse edge-to-edge attention Done — topk=128 matches full softmax (MRR=0.2472); topk=64 −5.6%; topk=256 OOM
65 Full Brain stack activation Deferred — 1 epoch ran (816.7s, healthy), aborted at ~$64/34hr compute cost. Engineering fixes committed.
66 1-hop vs 2-hop edge adjacency ablation Done — REJECTED on dense N=500 subgraph (hops=2 vs hops=1 within 1σ). Paper table updated with actual numbers.
67 Full sparse FB15k-237 hop-depth ablation + NBFNet baselines Planned — RunPod H100, topk=128 sparse attention, 3 seeds
68 BetaE 9-query-type benchmark Planned — standard comparison to published BetaE/GQE results
69+ Brain N=5000 retry + iterative refinement Planned — post-paper-submission

Horizon 4: Dynamic Reasoning — Future

Iterative graph refinement, temporal reasoning, multi-scale construction. See The Brain.

Horizon 5: The Brain

Multi-modal construction, associative memory, compositional generalization. See The Brain.


Publication Pathway

Target: NeurIPS / ICLR — "DELTA: Dual Edge-Linked Transformer Architecture for Compositional Reasoning on Knowledge Graphs"

What We Have

  • [x] Novel architecture with theoretical motivation
  • [x] 63 experiment phases with honest failure documentation
  • [x] Multi-seed statistical validation (Phases 29, 45, 53)
  • [x] Competitive LP on real FB15k-237 (Phases 40, 48, 52)
  • [x] Multi-hop compositional dominance (Phases 42–44)
  • [x] Inference efficiency story (Phase 45)
  • [x] Learnable temperature with mechanistic insight (Phases 46–52)
  • [x] Self-bootstrap result (Phase 39: 157% of FixedChain)
  • [x] Differentiable graph construction via BrainEncoder (Phases 55–57)
  • [x] LP/3p trade-off fully characterized (Phases 46–54)

What We Still Need

  • [ ] Sparse attention mechanism to address scaling bottleneck — Gap 4/5
  • [ ] Cross-family benchmark (YAGO3-10 or WN18RR LP) — Gap 3
  • [ ] Sequence domain evaluation (LRA ListOps) — Gap 6
  • [ ] Interpretability figure (attention heatmap on known reasoning chain)
  • [ ] Multi-seed brain_hybrid validation for statistical confidence

Paper Structure

  1. Introduction — The three-paradigm gap (Transformers -> GNNs -> DELTA); edges as first-class computational citizens; self-bootstrapped graph construction
  2. Related Work — Message-passing GNNs (CompGCN, RGCN); Transformers on graphs (GraphGPS, GRIT); KG completion (TransE, RotatE, BetaE)
  3. Architecture — DualParallelAttention; 2-hop edge adjacency; ReconciliationBridge; BrainEncoder (differentiable construction); learnable per-head temperature with edge/node asymmetry
  4. Experiments — Setup (FB15k-237 subset, evaluation protocol); LP results (Phases 40, 48, 52); multi-hop path queries (Phases 42–44, 1p–5p); robustness (Phase 43 DropEdge, Phase 45 multi-seed); temperature analysis (Phases 46–54); brain architecture (Phases 55–57)
  5. Inference Efficiency — Phase 45: per-query faster than GraphGPS
  6. Self-Bootstrap & Brain — Phase 39: 157% of FixedChain; Phases 55–57: differentiable construction matches delta_full with +4.7% H@10
  7. Conclusion — The Brain vision

See The Brain for long-term vision. See Validation Phases for all experiment results. See Key Findings for the 38 consolidated findings.