Skip to content

Validation Phases

All experiment phases with results. Phases 1-30 validated core architecture and fixes. Phases 31-37 scaled to real-world data. Phases 38-40 address graph construction and honest evaluation. Phases 41-45 establish multi-hop dominance. Phases 46-49 optimize attention temperature.


Phase 1–15: Core Architecture Validation

Phase Validates Task Result
1 Edge-to-edge attention discovers relational patterns Edge attention vs node attention vs MLP Edge 100%, Node 26.7%, MLP 100%
2 Dual parallel attention outperforms sequential Sequential vs dual (1/2 layers) All 100%; dual converges 2.7x faster
3 Importance router enables sparse attention Accuracy vs sparsity (100% → 20%) 100% accuracy at 80% sparsity
4 Tiered memory maintains recall Sequential recall task Perfect recall maintained
5 Transformer-bootstrapped graph construction Transformer alone vs transformer→DELTA (150 epochs each) TF: 98.3%, DELTA: 98.3% — pipeline preserves accuracy (non-relational task)
6 Full DELTA integration All components end-to-end All 4 sub-tests PASS
7 Gumbel-softmax differentiable routing Hard top-k vs Gumbel-softmax 12/12 router params get gradients (vs 0/12)
8 Scaling behavior 20 → 400 nodes, time vs accuracy O(n^0.81) sub-linear scaling
9 Multi-hop relational reasoning Knowledge graph with derived relations DELTA: 90.6%, Node GNN: 87.5%, MLP: 37.5%
10 Analogical reasoning "A is to B as C is to ?" Edge classification 100%; analogy retrieval needs contrastive training
11 Multi-hop edge adjacency 1-hop vs 2-hop on transitive relations 2-hop: 100% derived (vs 1-hop 61.1%, Node GNN 83.3%)
12 Gumbel curriculum routing Dense→sparse curriculum vs fixed Gradient flow confirmed (12/12); curriculum needs larger scale
13 Compositional benchmarks Logical rule-derived relations (7 types) DELTA 100% all relations (vs Node GNN 87.5% derived)
14 Contrastive analogy training Classification vs contrastive vs joint All methods 100% retrieval — edge attention is inherently discriminative
15 Synthetic KG benchmark 100 entities, 10 relations, 500 triples Node GNN, DELTA 1-hop, DELTA 2-hop all reach 100%; router at 50% sparsity degrades to 65.3%

Phase 16–21: Architectural Fix Benchmarks

After Phase 15, a pitfall analysis identified 6 architectural weaknesses. Each was fixed and validated:

The Six Fixes

Fix Problem Solution Phase
1 Router scores elements before seeing attention (chicken-and-egg) PostAttentionPruner: prune based on observed attention weights 16
2 Spectral partitioning is O(N³) — won't scale BFS seed-expansion partitioner in O(N+E) 20
3 Fixed linear memory compression loses information Variational bottleneck with KL regularization 18
4 Single averaged edge projection ignores per-layer structure Per-layer edge projections + edge combiner + active edge_type_head 19
5 Dense O(E²) multi-hop adjacency times out at ~500 edges Sparse COO tensor operations 17
6 Uniform dropout doesn't distinguish structural vs noisy edges LearnedAttentionDropout: per-edge dropout conditioned on features 21

Fix Validation Results

Phase Validates Benchmark Result
16 Post-attention soft gating + curriculum KG classification at 50% sparsity Soft gating 100%, curriculum 100%, old router 85.3%, hard post-attn 65.3%. Soft differentiable gating eliminates the original 29% gap and beats pre-attention routing by +14.7%
17 Sparse COO multi-hop scaling 1-hop/2-hop timing at 50→2500 edges O(E^0.97) scaling confirmed. 2500-edge 2-hop in 0.18s (was timeout with dense). All correctness checks pass
18 Variational memory compression Compression quality + downstream accuracy Accuracy preserved at 100% with compression. KL converges 0.126→0.026. Gradient flows through bottleneck. Similarity threshold is learnable
19 Per-layer edge projections Edge type diversity + relational classification Both reach 100% accuracy. Per-layer produces higher edge-type entropy (1.632 vs 1.562) — richer type diversity
20 BFS partition scaling Wall-clock time at 50→2500 nodes O(N^0.99) scaling. 2.0ms→90.8ms. Balance ratio 0.79. Importance-based seed spread: 100%
21 Learned attention dropout Generalization gap: no dropout vs uniform vs learned All modes reach 0 gap on current dataset (saturated). Eval-time passthrough confirmed. Needs harder benchmark

Phase 22–25: Scale & Integration Validation

Phase Validates Benchmark Result
22 Soft gating holds at 10× scale with noise N=1000, 15 relations, 5000 triples, 15% label noise Soft gating 100%, old router 81.6% (+18.4% gap). Zero generalization gap.
23 DELTA vs KG embedding baselines (TransE, RotatE, CompGCN) FB15k-237-like: 2000 entities, 20 typed relations, 8000 triples DELTA 100%, CompGCN 100%, TransE 67.6%, RotatE 70.7%. LP: TransE Hits@10=0.020, RotatE 0.010 (4×/2× random; sparse synthetic data). Soft gating maintains accuracy at 50% sparsity
24 All fixes integrated at scale N=1000, 15% noise, full pipeline + ablations All fixes integrate cleanly — zero degradation. 1-hop ablation runs 10× faster (44s vs 490s)
25 DELTA on real FB15k-237 (GPU) Actual Freebase triples: 2000-entity dense subgraph, 69,626 edges, 210 relation types, RTX 3080 Ti DELTA+Gate 97.6%, CompGCN 97.2%, TransE 78.8%, RotatE 77.8%. LP: TransE Hits@10=0.480, RotatE 0.335 (vs random 0.005). First real-data benchmark on GPU.

Phases 22, 23, and 25 were replicated with 5 seeds each in Phase 29 — see Phase 26–30 table for multi-seed statistics.


Phase 26–30 + 27b: Near-Term Roadmap Validation

Phase Validates Benchmark Result
26 Adaptive multi-hop: learn when to use 1-hop vs 2-hop N=200, 3 models (FixedHop, AdaptiveHop, AdaptiveHopGating) All models 100% at N=200. AdaptiveHopGate learns α→0.019 (suppresses 2-hop). Task saturates — validates cost-efficiency selection architecture.
27 (initial — broken training) Bootstrap on 2-hop path composition N=500, batch=1, 200 epochs, no scheduler TF 32.7%, Bootstrap DELTA 8.0%, Fixed Chain 5.3%. Results confounded by batch-1 training — see Phase 27b.
27b Corrected bootstrap evaluation (gradient accum + 2× data) N=1000, accum=32, 100 epochs, LR scheduler Fixed Chain DELTA 40.7% > Transformer 36.3% > Bootstrap 34.3%. Graph structure helps (+4.4% over Transformer). GraphConstructor attention-thresholding discards path ordering — the constructor is the bottleneck, not graph processing.
28 Hard ablation: find difficulty threshold where DELTA advantages emerge 4 levels (Easy/Medium/Hard/Extreme) × 3 models (Vanilla EdgeAttn, DualAttn, DELTA+Gate) Extreme: Dual Attention 64.2% >> Vanilla 40.2% (+24%). Node context is the key DELTA advantage at high noise. Soft gating adds ±0.6% beyond dual attention at extreme difficulty.
29 Multi-seed statistical credibility Phases 22, 23, 25 re-run with 5 seeds each DELTA+Gate 97.4% ± 0.1% on FB15k-237. Soft Gate 100.0% ± 0.0% vs Old Router 79.7% ± 1.1%. All results statistically stable. Total: 3.6 minutes.
30 GPU edge adj sampling strategy vs random Uniform, degree-weighted, stratified, importance-weighted sampling at 26% budget on FB15k-237 All 4 strategies within ±0.2% (~97.5%). Random sampling is sufficient at this graph density.

Phase 31–34: H100 / RTX PRO 6000 Experiments

Phase Validates Benchmark Result
31 Mini-batching scales to full FB15k-237 Full FB15k-237: 14,505 entities, 304K edges, 20 relations 100% test accuracy, converges by epoch 10. Mini-batch subgraph sampling + gradient accumulation confirmed on H100 80GB.
32 Cross-graph transfer Train FB15k-237, eval WN18RR Source 1.000. Zero-shot: 0.048 (≈ random 0.050) — features are domain-specific. Fine-tuned: 1.000 — pre-training helps with adaptation. Early stopping patience=10 reduces runtime from 24h → 3-4h.
33 Task-aware construction Fixed topology vs augmented with learned edges Fixed 0.347 ≈ Augmented 0.344 (5 seeds × 500 epochs). No improvement — 60-node tasks too small for learned edges to add value.
34 DELTA vs GraphGPS vs GRIT — synthetic Edge classification, noise robustness, path composition (H100) DELTA dominates all three tasks. Edge: DELTA 0.880 vs GraphGPS 0.293 vs GRIT 0.307 (+0.573). Noise@0.8: DELTA 1.000 vs GraphGPS 0.697 vs GRIT 0.710. Path: DELTA 1.000 vs GraphGPS 0.905 vs GRIT 0.893. All 5 seeds × 500 epochs.

Phase 35–37: Colab Pro+ Experiments

Phase 35: frozen encoder achieves 0.961 on WN18RR with 100 samples -- encoder is already domain-invariant, GRL unnecessary. Phase 36: constructor adds <=1.3% at scale -- de-emphasized.

Phase Validates Status Headline Result
35 Domain-agnostic relational transfer (GRL + linear probe) ✅ Complete Frozen encoder → 0.961 on WN18RR with 100 samples. GRL unnecessary — encoder already domain-invariant.
36 Task-aware construction at scale (500–5000 nodes) ✅ Complete Constructor adds ≤1.3%. De-emphasize in paper.
37 Real FB15k-237 parameter-matched comparison (4 models × 5 seeds) ⚠️ Invalidated Leakage audit failed — see Phase 37 Leakage Audit below. Scale validation (310K edges, GPU training, mini-batching) remains valid.

Phase 37: Leakage Audit

Phase 37's reported accuracy numbers (0.991–0.994) were invalidated after a systematic audit identified 5 critical evaluation issues:

Issue Problem Impact
Edge features encode answer relation_prototypes[r_id] + noise(0.1) in edge features — the model sees the label Trivial denoising, not relational reasoning
Wrong evaluation metric Edge classification accuracy instead of link prediction MRR/Hits@K Not comparable to published baselines
Test edges in training graph Test triples included in the message-passing graph Information leakage from test to train
No target masking Model can attend to the edge it's trying to predict Circular prediction
No negatives No corrupted triples for ranking evaluation Can't measure discrimination ability

What's still valid: Mini-batch subgraph sampling, multi-GPU training pipeline, gradient accumulation — the engineering infrastructure works at scale (14,505 entities, 304K edges). These claims are unaffected by the evaluation issues.

Resolution: Phase 40 rebuilds the evaluation pipeline from scratch, fixing all 5 issues with learned embeddings, train-only graphs, filtered MRR/Hits@K, and proper negative sampling.


Phase 38: Differentiable Task-Aware Constructor

(Experiment file: experiments/phase46_differentiable_constructor.py)

Phase 38 addresses the core philosophical tension from Phase 27b/33/36: DELTA's GraphConstructor used non-differentiable hard attention thresholding (attn > 0.1), meaning task loss couldn't influence which edges were created.

Three genuinely differentiable constructor variants using Gumbel-sigmoid edge selection with straight-through estimators. Full 3-seed results:

Variant Accuracy (3 seeds) vs FixedChain
Transformer (control) 0.387 ± 0.031 84%
FixedChain (control) 0.461 ± 0.034 100%
DifferentiableConstructor 0.393 ± 0.017 85%
TaskConditionedConstructor 0.397 ± 0.005 86%
HybridConstructor 0.452 ± 0.006 98%

Key findings:

  • Hybrid is the winner — preserving base sequential topology (gate=1) + learning additional edges reaches 98% of FixedChain with very low variance (±0.006)
  • Pure differentiable and TaskConditioned both plateau at ~85-86% — learning topology from scratch is harder than augmenting known good topology
  • All variants beat Transformer, confirming graph structure adds value
  • Sparsity regularization + temperature annealing (0.5→5.0) control edge density

Phase 39: Self-Bootstrapped DELTA

(Experiment file: experiments/phase46b_self_bootstrapped.py)

The breakthrough phase. Replace the transformer bootstrap with a FixedChain DELTA layer — DELTA constructs its own graph from trivial sequential input, then processes that self-constructed graph with a full DELTA stack. No transformer anywhere in the pipeline.

Full 3-seed results:

Model Accuracy (3 seeds) vs FixedChain
Transformer 0.429 ± 0.021 89%
FixedChain 0.481 ± 0.015 100%
P38_Hybrid 0.459 ± 0.036 95%
SelfBootstrap 0.757 ± 0.041 157%
SelfBootstrapHybrid 0.716 ± 0.038 149%

Why it works: The self-bootstrap DELTA runs a full edge-attention + reconciliation pass before the constructor sees the embeddings. These DELTA-enriched embeddings contain relational information that trivial positional embeddings lack — making edge discovery dramatically more reliable.

Implications:

  • The transformer scaffold is fully removable — DELTA bootstraps DELTA
  • DELTA's own pass enriches features more than any external bootstrap (+76% over Transformer)
  • This validates the path toward The Brain: self-constructing relational reasoning without external scaffolding

(Experiment file: experiments/phase46c_link_prediction.py)

Phase 40 rebuilds the entire evaluation pipeline to fix all 5 issues from the Phase 37 leakage audit:

Fix Implementation
Edge features nn.Embedding for entities and relations (no label information)
Evaluation metric Filtered MRR, Hits@1, Hits@3, Hits@10
Graph separation Train-only graph for message passing
Target masking Self-bootstrap encoder excludes prediction target
Negative sampling 1-vs-all BCE loss with label smoothing (0.1)

7 models tested: delta_full, delta_matched, graphgps, grit, distmult, self_bootstrap (1 bootstrap + 2 DELTA layers), self_bootstrap_hybrid (1 bootstrap + 3 DELTA layers).

Final results — 500 epochs, best val checkpoint (test MRR):

Model Params MRR Hits@1 Hits@3 Hits@10 Peak Epoch
GraphGPS 228K 0.5126 0.3745 0.5813 0.8128 200
SelfBootstrapHybrid 381K 0.5089 0.3632 0.5874 0.8158 250
DELTA-Matched 158K 0.4950 0.3457 0.5720 0.8035 200
DELTA-Full 293K 0.4938 0.3549 0.5586 0.7922 200
SelfBootstrap 299K 0.4891 0.3385 0.5617 0.7912 200
DistMult 47K 0.4841 0.3457 0.5484 0.7634 500+
GRIT 197K 0.4390 0.2953 0.4959 0.7603 200

Key findings:

  • SelfBootstrapHybrid vs GraphGPS: Only 0.004 MRR behind (-0.7%), and actually beats GraphGPS on Hits@10 (0.8158 vs 0.8128). The self-bootstrap variant is the most competitive DELTA model on real data.
  • All models overfit after ~200 epochs — GraphGPS, DELTA variants, and GRIT all peaked around epoch 200 and declined. The patience mechanism correctly selected best-val checkpoints.
  • SelfBootstrapHybrid peaked later (epoch 250) — slightly more robust to overfitting, consistent with the bootstrap pass adding regularization.
  • DistMult still climbing at epoch 500 — pure embedding method with no encoder continues improving; not yet converged.
  • DELTA-Matched efficiency: 0.4950 MRR with only 158K params vs GraphGPS at 0.5126 with 228K params — 69% of the parameters, 97% of the MRR.

Reference: published full FB15k-237 results

DistMult: MRR 0.241 · CompGCN: MRR 0.355 · RotatE: MRR 0.338. All Phase 40 models exceed published DistMult and CompGCN by large margins on the top-500 degree subset.


Phase 41: Generalization Gap Investigation

(Experiment file: experiments/phase41_generalization_gap.py)

Motivation: Phase 40 showed a consistent val/test gap for GRIT and DELTA variants — val MRR peaks ~5-10 points higher than test MRR. Phase 41 asked: does weight decay close this gap?

Sweep: delta_matched vs graphgps vs self_bootstrap_hybrid, weight decay ∈ {1e-4, 1e-3, 1e-2, 1e-1}, 500 epochs, 1 seed each.

Result: Negative — weight decay does not close the generalization gap. The gap is attributable to val-set noise (390 val vs 486 test triples; small splits cause volatile MRR estimates, not true overfitting). Peak val MRR varies ±0.01-0.02 with no consistent test improvement across any WD value.

Key finding: The gap is a measurement artifact of the small val split, not an overfitting signal. This motivates using the test checkpoint rather than best-val checkpoint for final reporting in future phases. Phase 40 results stand unchanged.


Phase 42: Multi-hop Path Query Evaluation

(Experiment file: experiments/phase42_multihop.py)

Motivation: Standard link prediction (1p) only evaluates immediate tail prediction. Multi-hop path queries test whether models can compose relational knowledge across multiple steps — the core capability DELTA's edge-to-edge attention is designed for.

Query types:

Type Description Count
1p Standard LP — direct tail prediction from test triples 486
2p Chain: (anchor, r₁, mid) ∈ TRAIN → (mid, r₂, answer) ∈ TEST 5,764
3p Chain: (anchor, r₁, m₁) ∈ TRAIN → (m₁, r₂, m₂) ∈ TRAIN → (m₂, r₃, answer) ∈ TEST 10,000

Leak-free construction: The script generates queries by chaining train edges to test edges, excluding 1-hop shortcuts (anchor→answer shortcut in TRAIN), deduplicating, and verifying via a built-in audit_queries() function. All 15,764 queries passed the leakage audit before any model was trained.

Scoring: Soft entity traversal — at each intermediate hop, score all entities via the DistMult decoder, apply softmax with temperature τ=1.0, take the weighted average entity embedding, pass to the next hop. Final hop scores are used for filtered ranking.

Filtered evaluation: Valid answers are computed from the full graph (train+val+test) for each query, following standard LP filtered-MRR protocol.

Results — Fast Models (seed=1)

Standard LP sanity check (matches Phase 40 exactly):

Model Test MRR Test H@10
DistMult 0.4841 0.7634
GraphGPS 0.5126 0.8128
GRIT 0.4390 0.7603

Results — DELTA Models (seed=1)

Standard LP sanity check:

Model Test MRR Test H@10
DELTA-Matched 0.4967 0.8025
DELTA-Full 0.4921 0.7942
SelfBootstrap 0.4872 0.7922
SelfBootstrapHybrid 0.5097 0.8169

Complete Multi-hop Results (all 7 models, seed=1)

Model Params 1p MRR 2p MRR 3p MRR 3p H@10 3p–1p Δ
DELTA-Matched 158K 0.5327 0.7332 0.7378 0.8731 +0.205
SelfBootstrapHybrid 381K 0.5412 0.7215 0.6946 0.8329 +0.154
GraphGPS 228K 0.5488 0.7180 0.6970 0.8498 +0.148
DELTA-Full 293K 0.5235 0.7108 0.6916 0.8361 +0.168
SelfBootstrap 299K 0.5123 0.7121 0.6861 0.8291 +0.174
GRIT 197K 0.4604 0.7122 0.6438 0.7834 +0.184
DistMult 47K 0.5315 0.7153 0.5657 0.7095 +0.034

Degradation analysis (MRR trajectory: 1p → 2p → 3p):

Model 1p 2p 3p 2p→3p trend
DELTA-Matched 0.533 0.733 0.738 +0.005 (improves)
SelfBootstrapHybrid 0.541 0.722 0.695 −0.027
GraphGPS 0.549 0.718 0.697 −0.021
DELTA-Full 0.524 0.711 0.692 −0.020
SelfBootstrap 0.512 0.712 0.686 −0.026
GRIT 0.460 0.712 0.644 −0.068
DistMult 0.532 0.715 0.566 −0.150

Key Findings

  1. DELTA-Matched is the 3p champion — 0.7378 MRR on 3-hop compositional queries, beating GraphGPS (0.6970) by +0.041 and every other model. With only 158K parameters (69% of GraphGPS), it dominates the task DELTA was designed for.

  2. DELTA-Matched is the ONLY model that improves from 2p→3p — MRR rises from 0.7332 to 0.7378. Every other model degrades as path length increases. This is the architectural thesis: 2-hop edge adjacency and dual attention compose without information loss.

  3. 2p MRR > 1p MRR for all models — 2-hop chains are "easier" because the relation pair constrains the answer space more tightly than a single relation.

  4. GNN advantage scales dramatically with hop count vs DistMult — At 1p, GraphGPS leads DistMult by only +0.017 MRR. At 3p, the gap grows to +0.131. Embedding baselines collapse on compositional reasoning.

  5. Capacity hurts multi-hop — DELTA-Matched (158K) beats DELTA-Full (293K) at 3p by +0.046. Larger models overfit to 1-hop link statistics; constrained capacity forces learning of more generalizable relational representations.

  6. Self-bootstrap pays a multi-hop tax — SelfBootstrap (0.686 3p) and SelfBootstrapHybrid (0.695 3p) trail DELTA-Matched (0.738) by 0.04–0.05. The bootstrap graph construction pass adds overhead without improving compositional reasoning on this dataset. However, SelfBootstrapHybrid has the best 1p MRR (0.541) — the hybrid uses learned construction where it helps (local predictions) and preserves enough structure for multi-hop.

  7. The paper reframing — DELTA trails GraphGPS by −0.018 on standard LP (1p). But at 3p, DELTA-Matched leads by +0.041. The correct evaluation for edge-first architectures is compositional reasoning, not memorization of local links.


Phase 43: DropEdge Regularization

Script: experiments/phase43_regularization.py

Question: Does DropEdge (random edge masking during training) improve multi-hop performance? Does it differentially help DELTA vs GraphGPS?

Protocol: 5 drop rates (0%, 10%, 20%, 30%, 40%) × 2 models (delta_matched, graphgps). DropEdge masks random edges in GNN input per batch; evaluation uses full graph. Same multi-hop queries as Phase 42 (16,250 total, leak-free).

Standard LP Results

Model Drop Peak Ep val_MRR test_MRR test_H@10
delta_matched 0% 125 0.5270 0.5086 0.8210
delta_matched 10% 125 0.5225 0.5081 0.8107
delta_matched 20% 225 0.5242 0.4840 0.7953
delta_matched 30% 225 0.5319 0.4819 0.7984
delta_matched 40% 125 0.5360 0.5139 0.7984
graphgps 0% 150 0.5243 0.5108 0.8241
graphgps 10% 150 0.5142 0.5008 0.8210
graphgps 20% 150 0.5195 0.5110 0.8128
graphgps 30% 150 0.5070 0.5072 0.8272
graphgps 40% 150 0.5122 0.4765 0.8117

Multi-hop Results (MRR)

Model Drop 1p 2p 3p 2p→3p Δ
delta_matched 0% 0.5397 0.7349 0.7403 +0.005
delta_matched 10% 0.5418 0.7401 0.7441 +0.004
delta_matched 20% 0.5155 0.7361 0.7235 −0.013
delta_matched 30% 0.5187 0.7367 0.7324 −0.004
delta_matched 40% 0.5484 0.7385 0.7443 +0.006
graphgps 0% 0.5260 0.7253 0.7113 −0.014
graphgps 10% 0.5326 0.7276 0.7155 −0.012
graphgps 20% 0.5419 0.7318 0.7227 −0.009
graphgps 30% 0.5285 0.7336 0.7202 −0.013
graphgps 40% 0.5096 0.7254 0.7249 −0.001

Key Findings

  1. Phase 43 is a robustness check on Phase 42. DELTA-Matched beats GraphGPS on 3p at every single drop rate (advantage ranges from +0.001 to +0.029). The multi-hop advantage is not a lucky hyperparameter choice — it's structural.

  2. DELTA leads on 2p at all 5 rates too. The pattern is consistent: DELTA's architectural bias toward structural reasoning gives it an advantage that regularization can't erase.

  3. GraphGPS benefits more from DropEdge in absolute terms (+0.014 on 3p) but starts lower and stays lower. Regularization helps both models avoid overfitting to local edge patterns, but can't close the architectural gap.

  4. Recommended headline configuration: DELTA-Matched @10% DropEdge — most consistent across all three query depths (1p: 0.542, 2p: 0.740, 3p: 0.744). Not the absolute best on any single metric, but the strongest claim for compositional depth.

  5. Honest limitation: 35× training cost. DELTA-Matched trains in ~3,600s vs GraphGPS ~106s, dominated by 2-hop edge adjacency on 9,703 triples. Inference time measurement needed to separate training cost from deployment cost.


Phase 44: Extended Multi-hop Depth (1p–5p)

Script: experiments/phase44_depth.py

Question: Does DELTA's compositional advantage extend to 4-hop and 5-hop chain queries? Does the gap grow or shrink with depth?

Protocol: 3 models (delta_matched, graphgps, distmult) evaluated on 5 query depths (1p–5p). 35,868 total queries (1p=486, 2p=5382, 3p=10000, 4p=10000, 5p=10000). Recursive chain builder for 4p/5p with strict leakage prevention. All queries verified leak-free.

Results by Depth

Depth n DELTA-Matched GraphGPS DistMult
1p 486 0.5413 0.5227 0.4936
2p 5,382 0.7578 0.7540 0.7280
3p 10,000 0.7531 0.7270 0.5830
4p 10,000 0.7665 0.7008 0.5112
5p 10,000 0.7896 0.6899 0.4567

MRR Trajectory (Δ between depths)

Model 2p→3p 3p→4p 4p→5p 2p→5p total
DELTA-Matched −0.005 +0.013 +0.023 +0.032
GraphGPS −0.027 −0.026 −0.011 −0.064
DistMult −0.145 −0.072 −0.055 −0.271

H@10 by Depth

Depth DELTA-Matched GraphGPS DistMult
1p 0.8539 0.8477 0.8004
2p 0.8894 0.8969 0.8575
3p 0.8810 0.8589 0.7443
4p 0.8849 0.8177 0.6719
5p 0.8953 0.8260 0.5866

Standard LP Sanity Check

Model test_MRR test_H@10 Peak Ep
delta_matched 0.5088 0.8220 125
graphgps 0.5085 0.8241 150
distmult 0.4651 0.7551 375

Key Findings

  1. DELTA-Matched is the only model that improves with reasoning depth. MRR rises from 3p→4p→5p (0.753→0.767→0.790). At 5p, DELTA's MRR (0.790) exceeds its own 2p (0.758). GraphGPS and DistMult degrade monotonically from 2p onward.

  2. The advantage accelerates. DELTA's lead over GraphGPS: +0.004 (2p) → +0.026 (3p) → +0.066 (4p) → +0.100 (5p). By 5p, the gap is 25× what it was at 2p.

  3. Degradation is proportional to structural capacity. GraphGPS (node attention) loses −0.064 from 2p→5p. DistMult (no structure) loses −0.271. DELTA (edge-first dual attention) gains +0.032. Edge adjacency enables cumulative compositional reasoning.

  4. H@10 tells the same story. DELTA maintains H@10 ≥ 0.88 at every depth. GraphGPS drops from 0.897 (2p) to 0.826 (5p). DistMult collapses to 0.587 (5p).


Phase 45: Inference Timing + Multi-seed Headline

Script: experiments/phase45_inference_timing.py

Question: Does the multi-hop advantage hold across seeds? Is DELTA's inference cost prohibitive for deployment?

Protocol: 2 configs (DELTA-Matched @10% DropEdge, GraphGPS @0% DropEdge) × 3 seeds. Same multi-hop queries as Phase 42 (16,250 total, leak-free). Inference timing: 3 warmup + 10 timed runs with CUDA synchronization. Measures encoding (GNN forward) and per-query scoring separately.

Multi-hop MRR (mean ± std, 3 seeds)

Config Params 1p MRR 2p MRR 3p MRR 2p→3p
DELTA-Matched @10% 157,696 0.5428±0.0057 0.7302±0.0112 0.7423±0.0086 +0.012
GraphGPS @0% 228,419 0.5287±0.0087 0.7273±0.0065 0.7128±0.0074 −0.014

Per-seed Breakdown

Config Seed 1p 2p 3p
DELTA 1 0.5429 0.7398 0.7460
DELTA 2 0.5357 0.7146 0.7305
DELTA 3 0.5497 0.7363 0.7504
GraphGPS 1 0.5237 0.7250 0.7042
GraphGPS 2 0.5215 0.7360 0.7121
GraphGPS 3 0.5410 0.7207 0.7222

Standard LP (mean ± std)

Config test MRR test H@10 Train time Peak epoch
DELTA-Matched @10% 0.4992±0.0076 0.7973±0.0196 3,782±81s 125
GraphGPS @0% 0.5005±0.0067 0.8141±0.0088 110±6s 167

Inference Timing

Metric DELTA-Matched GraphGPS Ratio
Encoding 454.08 ms 8.76 ms 51.8×
1p per-query 777.7 μs 921.9 μs 0.8×
2p per-query 1,380.4 μs 1,475.9 μs 0.9×
3p per-query 1,250.7 μs 1,371.2 μs 0.9×
Training 3,782 s 110 s 34.2×

Key Findings

  1. Multi-seed confirms the advantage. DELTA 3p MRR 0.742±0.009 vs GraphGPS 0.713±0.007. Std bars don't overlap. DELTA's worst seed (0.731) beats GraphGPS's best (0.722).

  2. Inference timing inverts the cost narrative. Encoding is 51.8× slower (2-hop edge adjacency), but per-query scoring is 0.8-0.9× GraphGPS — DELTA is faster per query. Encoding happens once; queries are scored many times. Deployment cost is comparable or better.

  3. Training cost is the honest limitation. 34.2× (3,782s vs 110s). This is a one-time cost dominated by edge adjacency computation, not a fundamental algorithmic limitation.

  4. Both models peak early and overtrain. DELTA peaks at ep 125 (all 3 seeds), GraphGPS at ep 150-200. Early stopping is critical for both.



See Key Findings for curated insights. See Status & Roadmap for current priorities. See The Brain for long-term vision.

Phase Experiment Status
41 Generalization gap investigation — weight decay sweep ✅ Complete — negative result (val-set noise, not overfitting)
42 Multi-hop path queries (1p/2p/3p) ✅ Complete — DELTA-Matched 3p MRR 0.738 beats GraphGPS (0.697) by +0.041
43 DropEdge robustness check ✅ Complete — DELTA leads on 3p at all 5 drop rates; advantage is structural
44 Extended multi-hop depth (4p/5p compositional queries) ✅ Complete — DELTA improves with depth (MRR 0.753→0.767→0.790); advantage over GraphGPS grows to +0.100 at 5p
45 Inference timing + multi-seed headline ✅ Complete — 3-seed: DELTA 0.742±0.009 vs GraphGPS 0.713±0.007; per-query inference 0.8-0.9× GraphGPS
46 Attention sharpening via learnable temperature (fix dead heads) ✅ Complete — dead heads 83%→38%, edge/node temp divergence discovered
47 Layer-specific temperature initialization ✅ Complete — B (L0=1, L1+L2=4) best LP MRR 0.4783; node attention needs sharpening to activate
48 Asymmetric node/edge temperature ✅ Complete — E (node=2, edge=6) LP MRR 0.4856 (+1.5%); LP/3p trade-off persists
49 L0 temperature + asymmetric L1+L2 ✅ Complete — H LP MRR 0.4887 (new record); L0 temp doesn't explain D's 3p; trade-off fundamental
50 Temperature annealing (node temp schedule) ✅ Complete — K (anneal 4→2 fast) 3p MRR 0.4148 (FIRST to beat D's 0.4018); LP=0.4819 misses target; Pareto frontier shifted
51 Static vs trajectory temperature ✅ Complete — P (anneal 4→2.5) new LP record 0.4890; trajectory confirmed (+0.015 3p bonus); N best-ever 4p/5p

Phase 46: Attention Sharpening via Learnable Temperature

Script: experiments/phase46_capacity_signal.py

Root cause (diagnostic): DELTA's attention weights are near-uniform after training. With d_head=12 (delta_matched: 48/4) and average degree ~40, softmax over 40 scores at std≈1.0 gives normalized entropy ≈ 0.87 — mathematically near-uniform. The model succeeds at LP/multi-hop entirely through entity embeddings + DistMult, bypassing attention via residual connections. Phase 45's 100% dead heads across all conditions confirm this.

Fix: Learnable per-head temperature (multiplier on attention scores before softmax). Stored as log-space parameter: temp = exp(_log_temp), always positive. Default init_temp=1.0 preserves backward compatibility. Higher init_temp → sharper attention.

Hypothesis: Starting with init_temp=4.0, dead-head fraction will drop from ~100% (temp=1.0 control) to < 50% for delta_matched. Learned per-head temperatures at convergence will be > 2.0 for ≥ 50% of heads, confirming the model benefits from sharp attention. Multi-hop 3p MRR ≥ 0.35 (regression safety).

Design: 4 conditions — delta_matched × {temp=1.0, temp=4.0} and delta_full × {temp=1.0, temp=4.0}. Same LP training pipeline as Phases 42–45. 500 epochs, early stopping patience=10.

Standard LP Results

Condition Params Best Val MRR Test MRR Test H@10 Peak Ep
delta_matched temp=1.0 158K 0.5519 0.5095 0.8179 150
delta_matched temp=4.0 158K 0.5395 0.4922 0.7984 425
delta_full temp=1.0 293K 0.5030 0.4744 0.7860 175
delta_full temp=4.0 293K 0.5106 0.4729 0.7901 450

Multi-hop Results (MRR)

Condition 1p 2p 3p 4p 5p
delta_matched temp=1.0 0.3161 0.2631 0.3855 0.3151 0.3710
delta_matched temp=4.0 0.3024 0.2511 0.3685 0.2965 0.3484
delta_full temp=1.0 0.2821 0.2435 0.3725 0.2823 0.3388
delta_full temp=4.0 0.2673 0.2444 0.4018 0.1598 0.2034

Attention Health

Condition Dead Heads Mean Entropy Entropy Range
delta_matched temp=1.0 11/16 (69%) 0.948 0.704–0.999
delta_matched temp=4.0 8/16 (50%) 0.848 0.457–0.997
delta_full temp=1.0 20/24 (83%) 0.966 0.779–0.999
delta_full temp=4.0 9/24 (38%) 0.757 0.233–0.997

Learned Temperature Evolution

Condition Head Type Init Final (mean) Trend
delta_matched temp=4.0 Edge (L1) 4.0 5.00 ↑ drifting UP
delta_matched temp=4.0 Node (L1) 4.0 3.99 ↓ drifting DOWN
delta_full temp=4.0 Edge (L2) 4.0 4.42 ↑ drifting UP fastest
delta_full temp=4.0 Node (L2) 4.0 3.98 ↓ drifting DOWN

Key Findings

  1. Temperature sharpening activates DELTA-Full's excess capacity. DELTA-Full temp=4.0 achieves 3p MRR 0.4018 (+0.029 over temp=1.0), with dead heads dropping from 83% to 38%. Layer 2 — which was 100% dead at temp=1.0 — comes fully alive (0% dead from epoch 100 onward).

  2. Temperature hurts DELTA-Matched. 3p MRR drops from 0.3855 to 0.3685 (−0.017) with temp=4.0. The smaller model's residual bypass was already optimal — forcing sharp attention disrupts what worked.

  3. Edge attention wants sharpness; node attention wants averaging. Learned temperatures diverge: edge temps drift UP from init (wants sharper), node temps drift DOWN (prefers softer). This is the most mechanistically informative result — the model discovers the distinction automatically via gradient descent.

  4. Layer 0 always stays dead regardless of temperature. First-layer attention performs neighborhood averaging, not selective routing. This is architecturally correct — initial embeddings carry no relational information worth selecting over.

  5. All heads retain temp > 2.0 at convergence. 16/16 (delta_matched) and 24/24 (delta_full) heads stay above 2.0, confirming the model benefits from maintaining sharp attention rather than collapsing temperatures back to 1.0.

  6. Cross-depth cosine similarity = 1.0 everywhere. Encoding is completely query-independent — the DistMult decoder and entity embeddings handle relational composition, not depth-dependent routing through the GNN layers.

Hypothesis Evaluation

Hypothesis Result Evidence
Dead heads < 50% for delta_matched PARTIAL 69%→50% at temp=4.0, missed <50% by 1 head
All heads retain temp > 2.0 CONFIRMED 16/16 and 24/24 above 2.0
3p MRR ≥ 0.35 (regression safety) CONFIRMED All conditions above 0.35
Temperature helps delta_full on 3p CONFIRMED +0.029 improvement (0.3725→0.4018)

Phase 47: Layer-Specific Temperature Initialization

Script: experiments/phase47_layer_specific_temp.py

Question: Can selective temperature sharpening (layer-specific or edge-only) achieve DELTA-Full temp=4.0's multi-hop gains without LP degradation? Where exactly does temperature matter?

Protocol: DELTA-Full (3 layers, 293K params) on FB15k-237 subset (494 entities). 500 epochs, eval every 25, patience 10. Phase 46 A & D hardcoded as reference. Two new conditions:

  • B (Layer-Sharp): L0 temp=1.0 (soft), L1+L2 temp=4.0 (sharp) — all attention types
  • C (Edge-Sharp): Node temp=1.0 (soft), Edge temp=4.0 (sharp) — all layers
Condition Config LP MRR LP H@10 Best Val MRR
A (Phase 46) All temp=1.0 0.4744 0.7860 0.5030
D (Phase 46) All temp=4.0 0.4729 0.7901 0.5106
B Layer-Sharp L0=1.0, L1+L2=4.0 0.4783 0.7757 0.5075
C Edge-Sharp node=1.0, edge=4.0 0.4745 0.7870 0.5026

Multi-hop Results (MRR)

Depth A (temp=1.0) D (temp=4.0) B (Layer-Sharp) C (Edge-Sharp)
1p — — 0.2713 0.2547
2p — — 0.2448 0.2458
3p 0.3725 0.4018 0.3908 0.3678
4p — — 0.1477 0.1374
5p — — 0.1768 0.1485

Attention Health (Final Model)

Layer.Type A Dead% D Dead% B Dead% C Dead%
L0.node 100% 100% 100% 100%
L0.edge 100% 100% 100% 100%
L1.node 100% 0% 0% 100%
L1.edge 100% 25% 25% 0%
L2.node 75% 0% 0% 50%
L2.edge 75% 0% 0% 0%
Total 20/24 (83%) 9/24 (38%) 9/24 (38%) 14/24 (58%)

Learned Temperature Evolution (B Layer-Sharp)

Epoch val_MRR L0_edge L0_node L1_edge L1_node L2_edge L2_node
25 0.0097 1.005 1.006 4.028 4.014 4.028 3.992
175 0.5075 1.068 1.040 4.187 4.042 4.415 3.980
425 0.4306 1.110 1.051 4.150 3.685 4.520 3.583

Learned Temperature Evolution (C Edge-Sharp)

Epoch val_MRR L0_edge L0_node L1_edge L1_node L2_edge L2_node
25 0.0085 4.017 1.007 4.033 1.001 4.033 0.999
200 0.5026 4.464 1.031 4.273 1.012 4.491 1.030
450 0.4272 4.496 1.053 4.143 1.001 4.570 1.004

Key Findings

  1. B is the new LP MRR champion (0.4783) — selective sharpening at L1+L2 beats both uniform temp=1.0 and uniform temp=4.0
  2. Node attention requires explicit sharpening — C (edge-only sharp) keeps L1 node heads 100% dead; B (all sharp at L1+L2) activates all node heads
  3. Layer 0 is structurally dead regardless of temperature — confirmed across all 4 conditions
  4. B achieves D-level dead head reduction (both 38%) while improving LP MRR
  5. The 3p gap remains: B's 3p (0.3908) is better than A (+0.018) but doesn't match D (0.4018, gap=0.011)
  6. Node temps drift DOWN from 4.0→3.5-3.7 in B — suggests optimal node temp is between 2-3, not 4.0
  7. Edge temps drift UP in both B and C (toward 4.4-4.5) — edge attention consistently wants more sharpness

Hypothesis Evaluation

Hypothesis Result Evidence
B or C matches D's 3p (≥0.4018) PARTIAL B: 0.3908, C: 0.3678 — closer but not matching
B or C maintains LP MRR (≥0.4744) CONFIRMED B: 0.4783 (best), C: 0.4745
Regression safety (3p ≥ 0.35) CONFIRMED Both above 0.35
Node attention can activate via edge sharpening alone REJECTED C keeps L1 node 100% dead

Phase 48: Asymmetric Node/Edge Temperature

Script: experiments/phase48_asymmetric_temp.py

Question: Phase 47 showed node temps drift DOWN (4.0→3.5) and edge temps drift UP (→4.5). Can initializing node and edge temperatures SEPARATELY — following each type's learned drift direction — achieve both B's LP MRR (0.4783) and D's 3p MRR (0.4018)?

Protocol: DELTA-Full (3 layers, 293K params) on FB15k-237 subset (494 entities). 500 epochs, eval every 25, patience 10. L0 always (1.0, 1.0). Three new conditions:

  • E (node=2, edge=6): Lower node temp, much higher edge temp — extreme asymmetry
  • F (node=3, edge=5): Moderate node, sharp edge — bracketing expected optimum
  • G (node=2.5, edge=5): Midpoint between E and F
Condition Config LP MRR LP H@10 Best Val MRR
A (Phase 46) All temp=1.0 0.4744 0.7860 0.5030
B (Phase 47) L0=1.0, L1+L2=4.0 0.4783 0.7757 0.5075
D (Phase 46) All temp=4.0 0.4729 0.7901 0.5106
E node2_edge6 L0=(1,1), L1+L2=(2,6) 0.4856 0.8004 0.4889
F node3_edge5 L0=(1,1), L1+L2=(3,5) 0.4837 0.8014 0.5113
G node2.5_edge5 L0=(1,1), L1+L2=(2.5,5) 0.4699 0.7881 0.4930

Multi-hop Results (MRR)

Depth A B D E F G
3p 0.3725 0.3908 0.4018 0.3872 0.3750 0.3970

Attention Health (Final Model)

Layer.Type E Dead% F Dead% G Dead%
L0.node 100% 100% 100%
L0.edge 100% 100% 100%
L1.node 0% 0% 0%
L1.edge 0% 0% 0%
L2.node 25% 0% 0%
L2.edge 0% 0% 0%
Total 9/24 (38%) 8/24 (33%) 8/24 (33%)

Learned Temperature Evolution (E node=2, edge=6)

Epoch val_MRR L1_edge L1_node L2_edge L2_node
25 0.0089 6.020 1.998 6.013 1.997
200 0.4889 6.225 2.005 6.494 1.991
450 0.4217 6.225 1.987 6.494 1.991

Learned Temperature Evolution (F node=3, edge=5)

Epoch val_MRR L1_edge L1_node L2_edge L2_node
25 0.0097 5.016 2.997 5.011 2.994
200 0.5113 5.235 2.974 5.420 2.995
450 0.4293 5.235 2.974 5.420 2.995

Key Findings

  1. E is the new LP MRR champion (0.4856, +1.5% over B's 0.4783) — asymmetric temperature following drift direction works
  2. F achieves highest-ever validation MRR (0.5113) and H@10 (0.8014) — node=3 is better for generalization, but E is better on test
  3. Node temps are "set and forget" — they stay within ±0.01 of initialization across all conditions and all training
  4. Edge temps always drift UP, and L2 drifts more than L1 (E: L1 6.0→6.27, L2 6.0→6.68) — deeper layers need more edge sharpness
  5. LP/3p trade-off persists: E (best LP 0.4856) has moderate 3p (0.3872); G (best 3p 0.3970) has lowest LP (0.4699)
  6. D's 3p advantage (0.4018) remains unmatched — all P48 conditions had L0=1.0 while D has L0=4.0. L0 temperature may explain the 3p gap
  7. All conditions peaked at epoch 200 and early-stopped at epoch 450 — consistent dynamics

Hypothesis Evaluation

Hypothesis Result Evidence
E or F achieves LP ≥ 0.4783 (Phase 47 B) CONFIRMED E: 0.4856, F: 0.4837
E or F achieves 3p ≥ 0.4018 (Phase 46 D) FAILED Best: G 0.3970 (−0.005)
Combined LP+3p improvement over all prior PARTIAL LP improved significantly, 3p gap persists
Regression safety (3p ≥ 0.35) CONFIRMED All above 0.35

Phase 49: L0 Temperature + Asymmetric L1+L2

Script: experiments/phase49_l0_temp.py

Question: Phase 48 showed D (all temp=4.0) is the only condition beating 3p MRR 0.40, and all P48 conditions had L0=1.0 while D has L0=4.0. Does adding L0=4.0 to E's asymmetric L1+L2 temperatures break the LP/3p trade-off?

Protocol: DELTA-Full (3 layers, 293K params) on FB15k-237 subset (494 entities). 500 epochs, eval every 25, patience 10. A, D, E hardcoded as reference from Phases 46/48. Three new conditions:

  • H (L0=4,4 + E's asymmetric): L0(node=4.0, edge=4.0), L1+L2(node=2.0, edge=6.0)
  • I (L0=4,4 + F's asymmetric): L0(node=4.0, edge=4.0), L1+L2(node=3.0, edge=5.0)
  • J (L0=2,4 + E's asymmetric): L0(node=2.0, edge=4.0), L1+L2(node=2.0, edge=6.0)
Condition Config LP MRR LP H@10 Best Val MRR
A (Phase 46) All temp=1.0 0.4744 0.7860 0.5030
D (Phase 46) All temp=4.0 0.4729 0.7901 0.5106
E (Phase 48) L0=(1,1), L1+L2=(2,6) 0.4856 0.8004 0.4889
H L0=(4,4), L1+L2=(2,6) 0.4887 0.7973 0.4925
I L0=(4,4), L1+L2=(3,5) 0.4836 0.8076 0.5071
J L0=(2,4), L1+L2=(2,6) 0.4872 0.8004 0.4903

Multi-hop Results (MRR)

Depth A D E H I J
1p — — — 0.2710 0.2665 0.2665
2p — — — 0.2555 0.2487 0.2560
3p 0.3725 0.4018 0.3872 0.3930 0.3783 0.3911
4p — — — 0.3333 0.2191 0.3323
5p — — — 0.3517 0.2312 0.3536

Attention Health (Final Model)

Layer.Type H Dead% I Dead% J Dead%
L0.node 100% 100% 100%
L0.edge 100% 100% 100%
L1.node 0% 0% 0%
L1.edge 0% 0% 0%
L2.node 25% 0% 25%
L2.edge 0% 0% 0%
Total 9/24 (38%) 8/24 (33%) 9/24 (38%)

Learned Temperature — Final Values

Condition L0_edge L0_node L1_edge L1_node L2_edge L2_node
H 4.447 4.039 6.266 2.004 6.660 2.014
I 4.409 4.076 5.260 3.007 5.631 2.991
J 4.447 2.048 6.262 2.005 6.666 2.013

Key Findings

  1. H is the new LP MRR champion (0.4887, +0.003 over E's 0.4856) — L0=4.0 adds a small LP boost on top of E's asymmetric L1+L2
  2. L0 temperature does NOT explain D's 3p advantage — H (L0=4,4 + E's asymmetry) achieves 3p=0.3930, still below D's 0.4018. The LP/3p trade-off is fundamental to asymmetric temperature initialization.
  3. I achieves best-ever H@10 (0.8076, +0.006 over F's 0.8014) but worst 3p of the three (0.3783) — F's L1+L2 config is LP/H@10-biased
  4. L0 node temperature is irrelevant — H (init=4.0, final=4.039) and J (init=2.0, final=2.048) produce near-identical L1+L2 temperatures and performance (LP 0.4887 vs 0.4872, 3p 0.3930 vs 0.3911). L0 is 100% dead so gradient is zero.
  5. H and J converge to nearly identical L1+L2 temps despite different L0 node inits — confirming L0 node is a dead parameter
  6. D's 3p=0.4018 is unique to uniform temp=4.0 initialization — 4 phases of targeted temperature experiments (46-49) have tested 10+ configurations and none replicate D's 3p advantage
  7. All conditions peaked at epoch 200 and early-stopped at epoch 450 — consistent with P48 dynamics

Hypothesis Evaluation

Hypothesis Result Evidence
H achieves LP ≥ 0.4856 (E's record) CONFIRMED H: 0.4887 (new best)
H achieves 3p ≥ 0.4018 (D's record) FAILED H: 0.3930 (−0.009)
Combined LP+3p trade-off broken PARTIAL LP record broken, 3p gap persists
L0 node temperature matters REJECTED H vs J: identical L1+L2 convergence despite L0_node 4.0 vs 2.0

Phase 50: Temperature Annealing (Node Temp Schedule)

Script: experiments/phase50_temp_anneal.py

Question: D's 3p MRR (0.4018) is unique to uniform temp=4.0, and no static asymmetric config replicates it (Phases 46-49). Does D's advantage come from the training trajectory — early epochs with high node temp — rather than final temperatures? Can annealing from 4.0→2.0 capture D's 3p while converging to E/H's LP-optimal asymmetry?

Protocol: DELTA-Full (3 layers, 293K params) on FB15k-237 subset (494 entities). 500 epochs, eval every 25, patience 10. All conditions start with L0=(4.0,4.0), L1+L2 node=4.0, edge=6.0. Node temps at L1+L2 annealed linearly, edge temps remain learnable. A, D, E, H hardcoded as references.

  • K (anneal_fast): node 4.0 → 2.0 over 250 epochs (50% of training), then learnable
  • L (anneal_slow): node 4.0 → 2.0 over 500 epochs (100% of training)
  • M (anneal_partial): node 4.0 → 3.0 over 250 epochs (50%), then learnable
Condition Config LP MRR LP H@10 Best Val MRR
A (Phase 46) All temp=1.0 0.4744 0.7860 0.5030
D (Phase 46) All temp=4.0 0.4729 0.7901 0.5106
E (Phase 48) L0=(1,1), L1+L2=(2,6) 0.4856 0.8004 0.4889
H (Phase 49) L0=(4,4), L1+L2=(2,6) 0.4887 0.7973 0.4925
K anneal 4→2 fast 0.4819 0.7901 0.5046
L anneal 4→2 slow 0.4803 0.8025 0.5002
M anneal 4→3 fast 0.4887 0.8004 0.5009

Multi-hop Results (MRR)

Depth A D E H K L M
1p — — — 0.2710 0.2661 0.2598 0.2610
2p — — — 0.2555 0.2560 0.2508 0.2558
3p 0.3725 0.4018 0.3872 0.3930 0.4148 0.3775 0.3803
4p — — — 0.3333 0.3107 0.2170 0.2110
5p — — — 0.3517 0.2811 0.2462 0.2522

Attention Health (Final Model)

Layer.Type K Dead% L Dead% M Dead%
L0.node 100% 100% 100%
L0.edge 100% 100% 100%
L1.node 0% 0% 0%
L1.edge 0% 0% 0%
L2.node 0% 0% 0%
L2.edge 0% 0% 0%
Total 8/24 (33%) 8/24 (33%) 8/24 (33%)

Learned Temperature — Final Values (Best Checkpoint)

Condition L0_edge L0_node L1_edge L1_node L2_edge L2_node
K (ep 175) 4.326 4.119 6.210 2.600 6.533 2.600
L (ep 200) 4.416 4.112 6.210 3.200 6.595 3.200
M (ep 200) 4.404 4.105 6.212 3.200 6.593 3.200

Annealing Trajectory (K — Best Condition)

Epoch val_MRR Node Sched L1_node L1_edge L2_node L2_edge
25 0.0095 3.80 3.800 6.045 3.800 6.045
100 0.2469 3.20 3.200 6.209 3.200 6.232
175 0.5046 2.60 2.600 6.210 2.600 6.533
250 0.4832 2.00 2.000 6.213 2.000 6.648
425 0.4378 learnable 1.933 6.004 1.933 6.456

Key Findings

  1. K is the FIRST configuration to beat D's 3p MRR — 0.4148 vs D's 0.4018 (+0.013). After 5 phases (46-50) testing 13+ configurations, temperature annealing finally breaks the 3p ceiling.
  2. K's best checkpoint (ep 175) has node temp=2.6 — not the anneal target (2.0). The optimal node temp for 3p is ~2.6, between D's effective ~3.6 and H's static 2.0.
  3. M ties H's LP MRR record (0.4887) with annealing 4→3 — gentler annealing preserves LP while giving a milder 3p boost (0.3803, above A's 0.3725).
  4. Pareto frontier shifted but combined target not met — K: 3p PASS but LP FAIL (−0.004); M: LP PASS but 3p FAIL. No single condition achieves both.
  5. K achieves strongest multi-hop across ALL depths — 4p=0.3107, 5p=0.2811 beat all prior conditions' values from reference data.
  6. All conditions achieve 33% dead heads — lowest of any configuration, one head fewer than H/E's 38%. Annealing activates slightly more capacity.
  7. L and M converge to identical best-checkpoint temps (node=3.2 at ep 200) despite different schedules — the training dynamics, not the schedule, determine the optimal checkpoint.
  8. After annealing ends, node temps continue drifting DOWN — K post-anneal: 2.0→1.93 at ep 425, confirming the "set-and-forget" pattern persists post-annealing.

Hypothesis Evaluation

Hypothesis Result Evidence
K achieves 3p ≥ 0.4018 (D's record) CONFIRMED K: 0.4148 (+0.013, new record!)
K achieves LP ≥ 0.4856 (E's record) FAILED K: 0.4819 (−0.004)
Combined LP+3p target met PARTIAL K beats 3p, M ties LP, no single condition achieves both
Slow annealing (L) outperforms fast (K) REJECTED L: LP=0.4803, 3p=0.3775 — worst of the three

Phase 51: Static vs Trajectory Temperature Optimization

Script: experiments/phase51_static_vs_trajectory.py

Question: K's best checkpoint has node temp=2.6. Does static node=2.6 (without annealing trajectory) replicate K's 3p=0.4148? Or does the training trajectory — early epochs at 4.0 then annealing down — create representations that static init cannot? Can moderate annealing (4→2.5) achieve both LP≥0.4856 AND 3p≥0.4018?

Protocol: DELTA-Full (3 layers, 293K params) on FB15k-237 subset (494 entities). 500 epochs, eval every 25, patience 10. All conditions start with L0=(4.0,4.0). Static conditions use train_with_temp_override(), annealing uses train_with_anneal(). A, D, E, H, K hardcoded as references.

  • N (static_2.6): L0=(4,4), L1+L2 node=2.6, edge=6.0 — tests if K's optimal checkpoint VALUE alone explains 3p
  • O (static_3.2): L0=(4,4), L1+L2 node=3.2, edge=6.0 — tests L/M checkpoint value
  • P (anneal_moderate): node 4.0 → 2.5 over 250 epochs (50%), then learnable — tests Goldilocks anneal endpoint
Condition Config LP MRR LP H@10 Best Val MRR
A (Phase 46) All temp=1.0 0.4744 0.7860 0.5030
D (Phase 46) All temp=4.0 0.4729 0.7901 0.5106
E (Phase 48) L0=(1,1), L1+L2=(2,6) 0.4856 0.8004 0.4889
H (Phase 49) L0=(4,4), L1+L2=(2,6) 0.4887 0.7973 0.4925
K (Phase 50) anneal 4→2 fast 0.4819 0.7901 0.5046
N static node=2.6 0.4746 0.7870 0.4926
O static node=3.2 0.4785 0.8004 0.4945
P anneal 4→2.5 0.4890 0.8014 0.5039

Multi-hop Results (MRR)

Depth A D E H K N O P
1p — — — 0.2710 0.2661 0.2560 0.2551 0.2621
2p — — — 0.2555 0.2560 0.2496 0.2468 0.2565
3p 0.3725 0.4018 0.3872 0.3930 0.4148 0.4001 0.3764 0.3823
4p — — — 0.3333 0.3107 0.3426 0.2445 0.2349
5p — — — 0.3517 0.2811 0.3788 0.2581 0.2693

Attention Health (Final Model)

Layer.Type N Dead% O Dead% P Dead%
L0.node 100% 100% 100%
L0.edge 100% 100% 100%
L1.node 0% 0% 0%
L1.edge 0% 0% 0%
L2.node 0% 0% 0%
L2.edge 0% 0% 0%
Total 8/24 (33%) 8/24 (33%) 8/24 (33%)

Learned Temperature — Final Values (Best Checkpoint)

Condition L0_edge L0_node L1_edge L1_node L2_edge L2_node
N (ep 200) 4.436 4.059 6.271 2.602 6.692 2.592
O (ep 200) 4.407 4.076 6.252 3.215 6.736 3.180
P (ep 200) 4.401 4.102 6.218 2.800 6.578 2.800

Trajectory vs Static Analysis

Metric K (annealed to ~2.6) N (static 2.6) Delta
LP MRR 0.4819 0.4746 −0.0073
3p MRR 0.4148 0.4001 −0.0147
4p MRR 0.3107 0.3426 +0.0319
5p MRR 0.2811 0.3788 +0.0977

Trajectory adds +0.015 to 3p but static is BETTER for 4p/5p (+0.032/+0.098).

Key Findings

  1. P achieves new LP MRR record (0.4890) — +0.0003 over H/M's 0.4887. Also new H@10 record (0.8014). Moderate anneal (4→2.5) preserves and slightly improves LP, but 3p=0.3823 misses target.
  2. Trajectory confirmed: +0.015 3p bonus from annealing — K (annealed to node=2.6 at checkpoint) achieves 3p=0.4148 vs N (static 2.6) 3p=0.4001. Early training at high node temp creates 3p-supportive representations that static init cannot replicate.
  3. N has best-ever deep-hop performance — 4p=0.3426, 5p=0.3788 exceed ALL previous conditions including K (4p=0.3107, 5p=0.2811). Static low node temp is uniquely strong for 4p/5p even though it's weaker for 3p.
  4. N nearly matches D's 3p — 0.4001 vs D's 0.4018 (gap=0.002). Static node=2.6 approaches but cannot match D's uniform temp=4.0 on 3p. The remaining 0.002 gap may be noise or a genuine D advantage from uniform initialization.
  5. O (static 3.2) disappoints — 3p=0.3764 is WORSE than N (0.4001), confirming optimal static node temp for 3p is closer to 2.6 than 3.2.
  6. P's best checkpoint has node temp=2.800 (the scheduled value at ep 200) — higher than K's 2.600, explaining why P prioritizes LP over 3p.
  7. All conditions peak at ep 200 and early-stop at ep 450 — matching Phase 50 dynamics exactly.
  8. K remains closest to combined target — LP gap only −0.004 while exceeding 3p by +0.013. P's LP gap is closed but 3p gap is −0.020.

Cumulative Pareto Frontier

Config LP MRR 3p MRR LP Target (0.4856) 3p Target (0.4018) Gap
K (P50) 0.4819 0.4148 −0.004 ✅ +0.013 LP only
P (P51) 0.4890 0.3823 ✅ +0.003 −0.020 3p only
H (P49) 0.4887 0.3930 ✅ +0.003 −0.009 3p only
N (P51) 0.4746 0.4001 −0.011 −0.002 Both

Hypothesis Evaluation

Hypothesis Result Evidence
N (static 2.6) matches K's 3p (≥0.4018) FAILED N: 0.4001 (−0.002)
O (static 3.2) matches K's 3p (≥0.4018) REJECTED O: 0.3764 (−0.025)
P (anneal 4→2.5) achieves LP ≥ 0.4856 CONFIRMED P: 0.4890 (new record!)
P achieves 3p ≥ 0.4018 FAILED P: 0.3823 (−0.020)
Trajectory matters (K > N at same final temp) CONFIRMED K 3p=0.4148 vs N 3p=0.4001 (+0.015)

Phase 52 — Closing K's LP Gap (Edge Sharpness + Faster Annealing)

Goal: Close K's LP gap (−0.004 from 0.4856 target) via two levers: (1) sharper edge init 7.0 (vs 6.0), (2) faster node anneal schedule (35% vs 50%).

Status: ✅ COMPLETE — 3 conditions tested. LP record broken (Q: 0.4905) but 3p anticorelation confirmed. Temperature investigation CLOSED after 7 phases.

Conditions

Condition Node Anneal Edge Init Anneal Fraction Key Change vs K
Q (K+edge7) 4.0→2.0 7.0 50% (250ep) Edge sharpness boost
R (K_faster) 4.0→2.0 6.0 35% (175ep) Faster anneal schedule
S (K+edge7+faster) 4.0→2.0 7.0 35% (175ep) Both changes combined
Condition LP MRR LP H@10 Val MRR Best Ep Dead Heads
A (baseline) 0.4744 0.7860 0.5030 — 20/24 (83%)
K (reference) 0.4819 0.7901 0.5046 175 8/24 (33%)
P (reference) 0.4890 0.8014 0.5039 200 8/24 (33%)
Q K+edge7 0.4905 0.8025 0.4992 200 8/24 (33%)
R K_faster 0.4793 0.7860 0.5050 175 8/24 (33%)
S K+edge7+faster 0.4902 0.8045 0.4932 200 8/24 (33%)

Multi-Hop MRR

Condition 1p 2p 3p 4p 5p
K (ref) 0.2655 0.2504 0.4148 0.3107 0.2811
N (ref) 0.2520 0.2513 0.4001 0.3426 0.3788
Q 0.2636 0.2557 0.3927 0.2863 0.2985
R 0.2679 0.2544 0.4114 0.3222 0.3031
S 0.2645 0.2532 0.3789 0.2828 0.2967

Edge Sharpness Effect (Controlled Comparison)

Comparison Edge Init LP MRR 3p MRR LP Δ 3p Δ
K (50% anneal) 6.0 0.4819 0.4148 — —
Q (50% anneal) 7.0 0.4905 0.3927 +0.009 −0.022
R (35% anneal) 6.0 0.4793 0.4114 — —
S (35% anneal) 7.0 0.4902 0.3789 +0.011 −0.033

Finding: Edge init 7.0 consistently boosts LP (+0.009 to +0.011) but consistently HURTS 3p (−0.022 to −0.033). The effect is purely anti-correlated.

Learned Temperature Analysis

Condition L2 Edge L2 Node L1 Edge L1 Node
Q 7.763 2.400 7.422 2.411
R 6.564 2.000 6.395 2.000
S 7.812 1.999 7.355 2.000

Key Findings

  1. Q achieves NEW LP MRR record: 0.4905 (+0.0015 over P's 0.4890). S achieves NEW H@10 record: 0.8045.
  2. Edge sharpness and 3p are anti-correlated — edge init 7.0 boosts LP but damages 3p by 2-3x the LP gain.
  3. Faster annealing (R) is slightly WORSE than K on both LP (0.4793 vs 0.4819) and 3p (0.4114 vs 0.4148). 50% schedule is optimal.
  4. R's deep-hop results (4p=0.3222, 5p=0.3031) are better than K but below N — confirming N's unique deep-reasoning advantage.
  5. Combined LP≥0.4856 AND 3p≥0.4018 target is UNACHIEVABLE in any single temperature configuration after 7 phases (46-52) and 20+ configs tested.
  6. Three distinct operating modes confirmed:
  7. LP-optimized: P/Q (LP≥0.4890, sharp edges, moderate node anneal)
  8. Balanced 3p: K (3p=0.4148, fast anneal 50%, edge=6.0)
  9. Deep reasoning: N (4p=0.3426, 5p=0.3788, static node=2.6)
  10. Temperature investigation CLOSED. The LP/3p trade-off is fundamental at the temperature level. Attention temperature controls reasoning depth, not just performance magnitude.

Updated Cumulative Pareto Frontier

Config LP MRR 3p MRR 4p MRR 5p MRR Character
Q (P52) 0.4905 0.3927 0.2863 0.2985 LP record
S (P52) 0.4902 0.3789 0.2828 0.2967 H@10 record (0.8045)
P (P51) 0.4890 0.3823 0.2349 0.2693 Prior LP leader
H (P49) 0.4887 0.3930 0.3333 0.3517 LP-optimized
K (P50) 0.4819 0.4148 0.3107 0.2811 3p record
R (P52) 0.4793 0.4114 0.3222 0.3031 Faster K variant
N (P51) 0.4746 0.4001 0.3426 0.3788 Deep reasoning

Hypothesis Evaluation

Hypothesis Result Evidence
Q (K+edge7) achieves LP≥0.4856 CONFIRMED Q: 0.4905 (new LP record!)
Q achieves 3p≥0.4018 FAILED Q: 0.3927 (−0.009 vs target)
R (faster anneal) closes K's LP gap FAILED R: 0.4793 (worse than K's 0.4819)
R preserves K's 3p≥0.4018 CONFIRMED R: 0.4114 (−0.003 vs K's 0.4148)
S (both changes) achieves combined target FAILED S: LP=0.4902 (PASS) but 3p=0.3789 (FAIL)
Combined LP+3p target achievable via temperature alone REJECTED 7 phases, 20+ configs, no solution. Trade-off is fundamental.

Phase 53 — Multi-Seed Validation of K and N

Goal: Validate K's 3p advantage and N's deep-hop advantage across 3 seeds (42, 123, 456) to confirm statistical robustness before publication.

Status: ✅ COMPLETE — CRITICAL REALITY CHECK. Multi-hop claims from Phases 46-52 are NOT robust. LP improvements ARE robust.

Multi-Seed Results: K (anneal 4→2, 50%)

Seed LP MRR LP H@10 3p MRR 4p MRR 5p MRR Dead
42 0.4880 0.8035 0.3812 0.2595 0.3009 8/24
123 0.4760 0.7788 0.3418 0.2074 0.2072 9/24
456 0.4856 0.8200 0.3866 0.2206 0.2363 8/24
Mean±Std 0.4832±0.0052 0.8008±0.0169 0.3699±0.0200 0.2292±0.0221 0.2481±0.0391

Multi-Seed Results: N (static node=2.6)

Seed LP MRR LP H@10 3p MRR 4p MRR 5p MRR Dead
42 0.4744 0.7891 0.3996 0.3228 0.3697 8/24
123 0.4822 0.7860 0.2859 0.1912 0.2012 9/24
456 0.4960 0.8241 0.3609 0.1923 0.2287 8/24
Mean±Std 0.4842±0.0089 0.7997±0.0173 0.3488±0.0472 0.2354±0.0618 0.2665±0.0738

Cross-Condition Comparison

Condition LP MRR 3p MRR 4p MRR 5p MRR
A baseline (ref) 0.4744 0.3725 — —
K (3 seeds) 0.4832±0.0052 0.3699±0.0200 0.2292±0.0221 0.2481±0.0391
N (3 seeds) 0.4842±0.0089 0.3488±0.0472 0.2354±0.0618 0.2665±0.0738

Reproducibility Check (Seed 42 vs Original)

Config Metric Original Re-run Delta
K LP MRR 0.4819 0.4880 +0.006
K 3p MRR 0.4148 0.3812 −0.034
N LP MRR 0.4746 0.4744 −0.000
N 3p MRR 0.4001 0.3996 −0.001
N 4p MRR 0.3426 0.3228 −0.020
N 5p MRR 0.3788 0.3697 −0.009

Key Findings

  1. K's 3p advantage is NOT statistically robust. Mean 3p=0.3699±0.0200 — BELOW baseline A (0.3725). K's Phase 50 result (3p=0.4148) was a single-seed outlier.
  2. N's deep-hop advantage is NOT statistically robust. Mean 4p=0.2354±0.0618, 5p=0.2665±0.0738 — HUGE variance. No seed consistently achieves 4p≥0.30 or 5p≥0.30.
  3. LP MRR IS robust. K: 0.4832±0.0052, N: 0.4842±0.0089 — consistent across seeds, both above baseline A (0.4744).
  4. CUDA non-determinism breaks seed reproducibility. K seed=42 re-run gives 3p=0.3812 vs original 0.4148 (delta=−0.034). LP is more stable (delta=+0.006).
  5. 500-query multi-hop evaluation is too noisy for single-seed temperature conclusions. N seed 123 has 3p=0.2859 while seed 42 has 3p=0.3996 — a 0.114 spread!
  6. All Phases 46-52 multi-hop claims (3p, 4p, 5p) must be treated as unreliable. Only LP MRR improvements from temperature tuning are statistically supported.
  7. The "three operating modes" narrative is revoked for multi-hop. Temperature improves LP reliably, but multi-hop effects are within noise.

Hypothesis Evaluation

Hypothesis Result Evidence
K's 3p≥0.4018 is robust (all seeds) REJECTED Mean 3p=0.3699, min=0.3418. Below baseline A.
K's 3p is above baseline A REJECTED Mean 3p=0.3699 < A's 0.3725. Overlapping std bars.
N's 4p≥0.30 is robust (all seeds) REJECTED Mean 4p=0.2354, min=0.1912.
N's 5p≥0.30 is robust (all seeds) REJECTED Mean 5p=0.2665, min=0.2012.
LP improvements from temperature are robust CONFIRMED K: 0.4832±0.0052, N: 0.4842±0.0089. Both > A (0.4744).

Phase 54 — High-Power Multi-Hop Evaluation (10k Queries)

Goal: Separate evaluation noise from model noise by increasing multi-hop query count from 500 to 10,000 per depth level. If variance drops ≥50%, evaluation noise was the dominant source.

Status: ✅ COMPLETE — Variance reduction hypothesis CONFIRMED (66-84%). Evaluation noise was the dominant variance source. Multi-hop investigation for DELTA-Full temperature tuning is CLOSED.

10k-Query Results: K (anneal 4→2, 50%)

Seed LP MRR LP H@10 3p MRR 4p MRR 5p MRR Dead
42 0.4888 0.8045 0.2622 0.3266 0.3406 8/24
123 0.4774 0.7912 0.2572 0.3170 0.3105 9/24
456 0.4874 0.8097 0.2649 0.3129 0.3172 8/24
Mean±Std 0.4845±0.0051 0.8018±0.0078 0.2614±0.0032 0.3188±0.0057 0.3228±0.0129

10k-Query Results: N (static node=2.6)

Seed LP MRR LP H@10 3p MRR 4p MRR 5p MRR Dead
42 0.4762 0.7922 0.2545 0.3118 0.3297 8/24
123 0.4860 0.7850 0.2456 0.2853 0.2890 9/24
456 0.4972 0.8220 0.2678 0.3107 0.3200 8/24
Mean±Std 0.4865±0.0086 0.7997±0.0160 0.2560±0.0091 0.3026±0.0123 0.3129±0.0174

Variance Comparison: 500q (Phase 53) vs 10,000q (Phase 54)

Condition Metric P53 std (500q) P54 std (10kq) Reduction
K 3p 0.0200 0.0032 84.1%
K 4p 0.0221 0.0057 74.1%
K 5p 0.0391 0.0129 66.9%
N 3p 0.0472 0.0091 80.7%
N 4p 0.0618 0.0123 80.1%
N 5p 0.0738 0.0174 76.5%

Average variance reduction: K = 75.0%, N = 79.1% (both far above 50% target).

Cross-Condition Comparison (10k queries)

Condition LP MRR 3p MRR 4p MRR 5p MRR
A baseline (ref, 500q) 0.4744 0.3725 — —
K (10kq, 3 seeds) 0.4845±0.0051 0.2614±0.0032 0.3188±0.0057 0.3228±0.0129
N (10kq, 3 seeds) 0.4865±0.0086 0.2560±0.0091 0.3026±0.0123 0.3129±0.0174

Note: Baseline A's 3p=0.3725 was measured at 500q. Direct comparison with 10kq absolute values is invalid — generate_extended_queries samples progressively harder paths at larger counts.

Key Findings

  1. Evaluation noise was the dominant variance source. 10k queries reduced cross-seed std by 66-84%, far exceeding the 50% target. P53's large variance was primarily from query sampling, not model noise.
  2. Model noise floor is small. Residual std at 10kq: 0.003-0.017 for multi-hop MRR. This is the irreducible floor from training stochasticity + CUDA non-determinism.
  3. K and N are statistically indistinguishable on multi-hop. 3p: 0.2614±0.0032 vs 0.2560±0.0091 (overlapping CIs). 4p and 5p also overlap. Temperature tuning does not differentially affect multi-hop reasoning.
  4. 500q vs 10kq gives dramatically different absolute MRR. K 3p: 0.37 (500q) → 0.26 (10kq). The evaluation protocol matters enormously — larger query samples include harder paths.
  5. LP MRR robust and consistent across protocols. K=0.4845±0.0051 (P54) vs 0.4832±0.0052 (P53). N=0.4865±0.0086 vs 0.4842±0.0089. LP uses full test set, so it's protocol-independent.
  6. DELTA-Full multi-hop investigation is CLOSED after 9 phases (46-54). Temperature reliably improves LP MRR but has no statistically supported effect on multi-hop reasoning depth.

Hypothesis Evaluation

Hypothesis Result Evidence
10kq reduces cross-seed std by ≥50% (K) CONFIRMED Avg 75.0% reduction (range: 66.9-84.1%)
10kq reduces cross-seed std by ≥50% (N) CONFIRMED Avg 79.1% reduction (range: 76.5-80.7%)
K multi-hop > N multi-hop with tight CIs NOT CONFIRMED K 3p=0.2614±0.0032 vs N=0.2560±0.0091. Overlapping CIs.
Evaluation noise dominated P53 variance CONFIRMED Model noise floor (10kq std) is 5-20x smaller than total variance (500q std).

All publication-grade results use 5 seeds, mean ± std reported. Phases 38–43 use 1-3 seeds for rapid iteration.


Phase 55 — Brain Architecture Port: Differentiable Graph Construction

Goal: First integration of BrainEncoder (Gumbel-sigmoid differentiable edge selection) with DELTA for link prediction on FB15k-237 (494-entity dense subset).

Configuration

  • Model: brain_hybrid (BrainEncoder + DELTAConv)
  • Constructor density: 0.02 (4,870 edges)
  • d_node=64, d_edge=32, 2 layers, 4 heads
  • 150 epochs, batch_size=512, lr=0.001
  • Hardware: RTX PRO 6000 Blackwell (98GB VRAM), Colab SSH

Results

Model MRR H@1 H@3 H@10 Params
brain_hybrid (d=0.02) 0.4773 0.3354 0.5473 0.7973 311,361
delta_full (reference) 0.4796 0.3476 0.5370 0.7603 264,385

Hypothesis Evaluation

Hypothesis Result Evidence
brain_hybrid LP MRR ≥ 0.475 PARTIAL MRR=0.4773, misses by 0.002. H@10=0.7973 (+3.7% over delta_full).
Constructed edges improve H@10 CONFIRMED +0.037 H@10 over delta_full baseline.

Phase 56 — Constructor Density Ablation: Precision vs Recall Trade-off

Goal: Test whether lower constructor density (0.01) outperforms higher density (0.02) by reducing noise while maintaining recall advantage.

Configuration

  • Condition C: brain_hybrid @ d=0.01 (2,435 edges), 300 epochs
  • Condition D: brain_hybrid @ d=0.02 (4,870 edges), 300 epochs
  • Other settings same as Phase 55

Results

Condition Density MRR H@1 H@3 H@10 Edges
C (d=0.01) 0.010 0.4794 0.3395 0.5473 0.8076 2,435
D (d=0.02) 0.020 0.4678 0.3189 0.5473 0.7500 4,870
delta_full (ref) — 0.4796 0.3476 0.5370 0.7603 —

Hypothesis Evaluation

Hypothesis Result Evidence
d=0.01 outperforms d=0.02 on LP MRR CONFIRMED +0.012 MRR with half the edges.
d=0.01 achieves MRR ≥ 0.480 PARTIAL MRR=0.4794, misses by 0.0006.
d=0.01 matches delta_full MRR CONFIRMED 0.4794 vs 0.4796 (within noise). +4.7% H@10.

Phase 57 — Brain Temperature Annealing: Closing the 0.480 MRR Gap

Goal: Apply K-style and Q-style temperature annealing to brain_hybrid @ d=0.01 with 200 epochs to cross the 0.480 MRR threshold.

Configuration

  • Condition A: baseline (temp=1.0, no anneal, 200 epochs)
  • Condition B: K-style anneal (node 4→2 over 50% training, edge=6.0, 200 epochs)
  • Condition C: Q-style anneal (node 4→2 over 50% training, edge=7.0, 200 epochs)

Results

Condition MRR H@1 H@3 H@10 Note
A (baseline) 0.4808 0.3230 0.5576 0.8076 Crosses 0.480!
B (K-anneal) 0.4818 0.3488 0.5597 0.7613 +0.001 MRR, −0.046 H@10
C (Q-anneal) 0.4769 0.3436 0.5288 0.7613 Worst — edge=7.0 hurts

Hypothesis Evaluation

Hypothesis Result Evidence
200ep baseline exceeds 0.480 MRR CONFIRMED A MRR=0.4808 (+0.0014 vs P56 C at 300ep).
K-anneal improves brain_hybrid LP REJECTED +0.001 MRR but −0.046 H@10. Not recommended.
Q-anneal (edge=7.0) improves LP REJECTED MRR=0.4769, WORSE than baseline. Edge=7.0 hurts brain_hybrid.

Phase 58 — Multi-seed Brain Density Validation

Goal: Validate brain_hybrid @ d=0.01 robustness across 3 seeds (42, 123, 456) and test whether d=0.005 continues the density improvement trend.

Configuration

  • Condition A: brain_hybrid @ d=0.01, 3 seeds, 200 epochs, temp=1.0
  • Condition B: brain_hybrid @ d=0.005, 3 seeds, 200 epochs, temp=1.0
  • batch_size=512, lr=0.001, sparsity_weight=0.01, patience=10

Results — Individual Runs

Condition Seed Density MRR H@1 H@3 H@10 Edges Time(s)
A 42 0.010 0.4856 0.3313 0.5586 0.8035 2,435 2,279
A 123 0.010 0.4719 0.3169 0.5607 0.7912 2,435 1,963
A 456 0.010 0.4956 0.3549 0.5514 0.8035 2,435 2,029
B 42 0.005 0.4737 0.3128 0.5556 0.8025 1,217 1,989
B 123 0.005 0.4567 0.3158 0.5267 0.7397 1,217 1,690
B 456 0.005 0.4715 0.3292 0.5442 0.7767 1,217 1,745

Results — Aggregated (mean ± std)

Condition MRR H@1 H@3 H@10
A (d=0.01) 0.4844±0.0097 0.3344±0.0157 0.5569±0.0040 0.7994±0.0058
B (d=0.005) 0.4673±0.0076 0.3193±0.0071 0.5422±0.0118 0.7730±0.0258

Key Observations

  1. d=0.01 mean MRR 0.4844 EXCEEDS 0.480 target (+0.0044). 2/3 seeds individually above 0.480.
  2. Seed=456 achieves new brain_hybrid MRR record: 0.4956, approaching delta_full's temperature-tuned LP record (0.4905).
  3. d=0.005 is significantly worse (−0.017 MRR, −0.026 H@10) with 4.4× higher H@10 variance.
  4. Density optimization path is exhausted: d=0.02→d=0.01 improves, d=0.01→d=0.005 degrades. d=0.01 is the sweet spot.

Hypothesis Evaluation

Hypothesis Result Evidence
d=0.01 mean MRR ≥ 0.480 (robust) CONFIRMED Mean=0.4844±0.0097, exceeds target.
d=0.005 continues improvement trend REJECTED Mean MRR=0.4673, below d=0.01 by −0.017.
d=0.01 H@10 is consistent across seeds CONFIRMED 0.7994±0.0058, all seeds >0.79.

Phase 59 — Medium-Scale Evaluation: DELTA at N=2000

Goal: Test whether DELTA's edge-to-edge attention scales from N≈500 to N=2000 on FB15k-237. First evaluation beyond the 3.4% entity subset used in Phases 40–58.

Configuration

  • Dataset: FB15k-237, max_entities=2000 → 1,991 entities, 207 relations, 62,733 train triples
  • Edge adjacency: 15,217,194 pairs (via torch_sparse.spspmm)
  • d_node=64, d_edge=32, num_heads=4, seed=42
  • Hardware: A100-SXM4-80GB (3-layer runs), RTX PRO 6000 Blackwell 98GB (1-layer diagnostic)

Results — 3-Layer DELTA (Condition A)

Attempt Mode LR Epochs MRR H@10 Time(s)
1 fullbatch 0.001 200 0.0018 0.0010 1,495
2 fullbatch 0.01 90 0.0021 — ~700
3 mini-batch (bs=4096) 0.003 30 0.0014 — ~2,700

Results — Diagnostics

Model Layers Epochs val_MRR (peak) test_MRR test_H@1 test_H@10 Time(s)
DistMult (no GNN) 0 200 0.3185 (ep100) — 0.1986 0.5820 72
delta_1layer 1 200 0.3338 (ep150) 0.3094 0.1798 0.5935 5,788

1-Layer DELTA Training Trajectory

Epoch Loss val_MRR val_H@1 val_H@3 val_H@10
25 — 0.0024 0.0000 0.0000 0.0000
50 0.0040 0.0716 0.0424 0.0792 0.1255
75 0.0038 0.1608 0.0814 0.1781 0.3161
100 0.0035 0.2652 0.1598 0.3039 0.4698
150 0.0032 0.3338 0.2131 0.3797 0.5849
200 0.0030 0.3184 0.1909 0.3575 0.5987

Key Observations

  1. 3-layer DELTA is categorically broken at N=2000 — MRR ~0.002 across all hyperparameter configurations, 200× below DistMult.
  2. 1-layer DELTA surpasses DistMult — val MRR 0.3338 vs 0.3185 (+0.015). H@10 0.5935 vs 0.5820 (+0.012 on test).
  3. Depth is the sole cause of over-smoothing — 1→3 layers causes 167× MRR degradation at N=2000.
  4. Edge-to-edge attention mechanism confirmed viable at scale — the architecture's identity is preserved.

Hypothesis Evaluation

Hypothesis Result Evidence
3-layer DELTA MRR ≥ 0.30 at N=2000 REJECTED MRR=0.0018, near-random across 3 training configs.
1-layer edge-to-edge attention works at N=2000 CONFIRMED val_MRR=0.3338, surpasses DistMult baseline.
brain_hybrid viable at N=2000 DEFERRED OOM at 102K edges on A100 (80GB).

Phase 60 — Residual Gating for Depth Scaling (N=2000)

Goal: Test whether learnable residual gates enable multi-layer DELTA at N=2000, recovering from Phase 59's catastrophic 3-layer over-smoothing.

Configuration

  • Dataset: FB15k-237, max_entities=2000 → 1,991 entities, 207 relations, 62,733 train triples
  • Edge adjacency: 15,217,194 pairs (shared, pre-computed)
  • Gate: per-layer learnable logits, init so sigmoid(α)≈0.1 → 90% residual
  • d_node=64, d_edge=32, num_heads=4, lr=0.003, bs=4096, seed=42, 200 epochs
  • Hardware: RTX PRO 6000 Blackwell 98GB (RunPod)

Summary Results (Test Set)

Condition Layers Gate Test MRR H@1 H@3 H@10 Params Time
A: 2L+gate 2 Yes 0.3065 0.1851 0.3490 0.5689 311,652 3.2hr
B: 3L+gate 3 Yes 0.3138 0.1962 0.3511 0.5591 393,830 4.8hr
C: 1L ctrl 1 No 0.3093 0.1796 0.3495 0.5939 229,472 1.6hr
Phase 59: 3L ungated 3 No 0.0018 — — — — —

Gate Evolution

Condition Final Node Gates Final Edge Gates
A: 2L+gate [0.135, 0.137] [0.097, 0.100]
B: 3L+gate [0.129, 0.130, 0.130] [0.105, 0.102, 0.100]

Gates barely shift from initialization (0.100) over 200 epochs. The model operates at ~10-13% layer contribution, ~87-90% residual.

Key Observations

  1. Residual gating is a complete fix for over-smoothing — 3L+gate MRR=0.3138 vs 3L ungated MRR=0.0018 (174× improvement).
  2. All depths produce equivalent test MRR (~0.31) — depth adds no accuracy benefit with gating.
  3. Gates are frozen hyperparameters, not learned parameters — they don't shift during training.
  4. 1-layer remains most efficient — 1.6hr vs 4.8hr for 3-layer, with comparable accuracy.

Hypothesis Evaluation

Hypothesis Result Evidence
2L+gate MRR ≥ 0.30 at N=2000 CONFIRMED Test MRR=0.3065, meets threshold.
3L+gate recovers from catastrophic over-smoothing CONFIRMED MRR 0.0018→0.3138 (174× improvement).
Multi-layer outperforms 1-layer with gating NOT CONFIRMED 3L=0.3138, 2L=0.3065, 1L=0.3093 — within noise.

Phase 61 — DistMult vs DELTA Across Scales

Goal: Determine whether DELTA's edge-to-edge attention provides measurable lift over DistMult at any scale (N=500, 1000, 2000).

Configuration

  • Dataset: FB15k-237 at N=500 (494 ent, 9703 train), N=1000 (998 ent, 26761 train), N=2000 (1991 ent, 62733 train)
  • Per-scale hyperparameters:
  • N=500: bs=512, lr=0.001, 500 epochs (9,500 total steps)
  • N=1000: bs=2048, lr=0.002, 300 epochs (4,200 total steps)
  • N=2000: bs=4096, lr=0.003, 200 epochs (3,200 total steps)
  • d_node=64, d_edge=32, num_heads=4, seed=42, Adam
  • Hardware: RTX PRO 6000 Blackwell 98GB (RunPod)

Summary Results (Test Set)

Model N val_MRR test_MRR test_H@1 test_H@10 Δ vs DM Time
DistMult 500 0.4736 0.3747 0.2407 0.6698 — 10s
1L-DELTA 500 0.4892 0.3583 0.2150 0.6636 -0.0163 1054s
3L-DELTA 500 0.4882 0.3816 0.2418 0.6862 +0.0069 3124s
DistMult 1000 0.3288 0.3342 0.2257 0.5503 — 12s
1L-DELTA 1000 0.4024 0.3604 0.2159 0.6613 +0.0262 1626s
DistMult 2000 0.2271 0.2297 0.1475 0.3845 — 60s
1L-DELTA 2000 0.3357 0.3088 0.1787 0.5963 +0.0791 5896s

Convergence Speed Comparison

Scale DM peak val step 1L peak val step 1L speedup
500 7,068 (ep372) 2,356 (ep124) 3.0×
1000 4,200+ (still rising) 2,072 (ep148) >2.0×
2000 3,200+ (still rising) 2,800 (ep175) >1.1×

Key Observations

  1. DELTA converges 2-3× faster per gradient step — this is the primary architectural contribution.
  2. At N=500 (sufficient training), DM wins test by -0.016. Both models overfit; DM overfits less.
  3. At N=1000/2000 (limited training), DELTA wins by +0.026 and +0.079 respectively.
  4. The growing advantage at larger N reflects DM's sensitivity to step count, not intrinsic DELTA superiority.
  5. 3L ≈ 1L at N=500 — additional depth adds nothing (consistent with Phase 60).
  6. DELTA is step-count invariant: reaches MRR~0.31 at N=2000 in 2,800 steps or 24,600 steps. DM needs all 24,600.

Hypothesis Evaluation

Hypothesis Result Evidence
DELTA test MRR ≥ DM + 0.01 at some scale CONFIRMED N=1000: +0.026, N=2000: +0.079
DELTA test MRR ≥ DM at N=500 REJECTED -0.016 due to harder overfitting
3L improves over 1L at N=500 NOT CONFIRMED Test +0.023 but peak val identical

Phase 61b — DistMult Extended Training at N=2000

Goal: Give DistMult 2000 epochs (10× Phase 61) at N=2000 to determine whether DELTA's +0.079 test MRR advantage is genuine or an artifact of under-training DistMult.

Key Result

DistMult with 2000 epochs peaks at val MRR=0.3185 (ep100, 12s wall time) then overfits catastrophically to ~0.23 by ep300 and never recovers. Test@best_val MRR=0.2329. DELTA's test MRR=0.3088 (+0.076 gap) reflects genuine generalization advantage — GNN parameter sharing acts as implicit regularization.


Phase 62 — Scaling DELTA to N=5000: Test MRR Generalization

Goal: Test whether DELTA's generalization advantage over DistMult (confirmed at N=2000, +0.076 test MRR gap) persists at N=5000.

Configuration

  • Dataset: FB15k-237 at N=5000 (4977 entities, 225 relations, 152,809 train / 8,788 val / 10,156 test)
  • d_node=64, d_edge=32, num_heads=4, seed=42, Adam, bs=4096, lr=0.003
  • DistMult: 2000 epochs, eval every 100
  • DELTA 1L: 200 epochs, eval every 25
  • Edge adjacency: 63,001,372 full pairs → subsampled to 15,000,000 (23.8%)
  • Hardware: RTX PRO 6000 Blackwell 98GB (RunPod), 29,202s (8.1hr)

Sanity Check — Subsampled E_adj at N=2000

Model peak_val best_ep test_MRR test_H@1 test_H@10 Time
1L (sub E_adj) 0.3362 175 0.3371 0.2122 0.5996 5726s
1L (full, P61 ref) 0.3357 175 0.3088 0.1787 0.5963 5896s

Delta: +0.0283. PASSED asymmetric gate (sub ≥ ref − 0.01).

N=5000 Results

Model N trip/ent peak_val best_ep test_MRR test_H@1 test_H@10 Time
DM_N=5000 5000 30.7 0.2222 200 0.2244 0.1320 0.4207 1784s
1L_N=5000 5000 30.7 0.2420 125 0.2404 0.1397 0.4566 21688s

Scaling Curve — DELTA vs DistMult Test MRR Gap

N DM test 1L test Gap
500 0.4778 0.4818 +0.004
2000 0.2329 0.3088 +0.076
5000 0.2244 0.2404 +0.016

Hypothesis Evaluation

Hypothesis Result Evidence
DELTA test MRR ≥ DM + 0.04 at N=5000 REJECTED Gap = +0.016, below 0.04 threshold
Subsampled E_adj preserves DELTA performance CONFIRMED Sanity check: +0.028 vs reference
DELTA advantage grows monotonically with scale REJECTED Non-monotonic: +0.004 → +0.076 → +0.016

Phase 63 — Edge Adjacency Subsampling Ablation at N=5000

Goal: Quantify the impact of edge adjacency subsampling on DELTA's performance at N=5000. Phase 62 used only 23.8% of the full E_adj (15M of 63M pairs) — three independent analyses identified this as a potential confound.

Configuration

  • Dataset: FB15k-237 at N=5000 (4977 entities, 225 relations)
  • Model: DELTA 1-layer, 4 heads, d_node=64, d_edge=32, init_temp=1.0
  • Epochs: 150 max, eval every 25, early stopping PATIENCE=2
  • bs=4096, lr=0.003, Adam, seed=42
  • Full E_adj: 63,001,372 pairs (built in 0.4s via spspmm)
  • Hardware: RTX PRO 6000 Blackwell 98GB (RunPod), 48.0hr total

Conditions

Condition E_adj Budget Retention Source
A (baseline) 15M 23.8% Phase 62 (reused)
B 30M 47.6% New
C 45M 71.4% New
D 63M 100.0% New

Results

Condition Budget Retention peak_val best_ep test_MRR test_H@1 test_H@10 gap_vs_DM Time
A (P62) 15M 23.8% 0.2420 125 0.2404 0.1397 0.4566 +0.0160 21688s
B 30M 47.6% 0.2499 125 0.2471 0.1481 0.4562 +0.0227 37653s
C 45M 71.4% 0.2460 125 0.2439 0.1446 0.4578 +0.0195 56365s
D 63M 100.0% 0.2513 125 0.2457 0.1459 0.4558 +0.0213 78872s

DistMult baseline: test_MRR=0.2244 (from Phase 62)

Key Observations

  1. Non-monotonic test MRR: B (47.6%) > D (100%) > C (71.4%) > A (23.8%). Optimal retention is ~47.6%, not full retention.
  2. Val→test gap grows with retention: A=0.0016, B=0.0028, C=0.0021, D=0.0056. Full E_adj causes mild overfitting.
  3. H@10 invariant to E_adj budget: 0.4558–0.4578 across all conditions. Differentiation occurs only in H@1 and MRR.
  4. Attention dilution warmup: B/C/D show near-zero MRR at ep25 (~0.001 vs A's 0.169). More neighbors dilute softmax weights, requiring ~50 epochs to recover.
  5. All conditions peak at ep125. Training duration is invariant to E_adj budget.

Hypothesis Evaluation

Hypothesis Result Evidence
E_adj retention ≥47.6% improves test MRR by ≥0.02 PARTIAL Best Δ = +0.0067 (B vs A), below 0.02 threshold
Full E_adj closes the gap vs DistMult to ≥0.04 No D gap = +0.0213, still below 0.04
Subsampling was the primary Phase 62 confound No Subsampling accounts for only ~0.007 of the "missing" gap

Phase 64 — Top-k Sparse Edge-to-Edge Attention at N=5000

Goal: Validate top-k sparse edge attention as a quality-preserving mechanism for scaling DELTA to larger graphs where full E_adj softmax is memory-infeasible. Characterize the quality vs. sparsity tradeoff across topk=64, 128, and 256.

Configuration

  • Dataset: FB15k-237 at N=5000 (4977 entities, 225 relations)
  • Model: DELTA 1-layer, 4 heads, d_node=64, d_edge=32, init_temp=1.0
  • Epochs: 150 max, eval every 25, early stopping PATIENCE=2
  • bs=4096, lr=0.003, Adam, seed=42
  • Full E_adj: 63,001,372 pairs (built in 0.4s via spspmm)
  • Hardware: RTX PRO 6000 Blackwell 98GB (RunPod), 46.0hr total

Conditions

Condition E_adj Budget TopK Notes
A (baseline) 30M (47.6%) — Phase 63 Cond B, reused
B 63M (100%) 128 Sparse attention
C 63M (100%) 64 Sparse attention
D 63M (100%) 256 OOM after ep25

Results

Condition Budget TopK peak_val best_ep test_MRR test_H@1 test_H@10 gap_vs_DM Time
A (P63) 30M — 0.2499 125 0.2471 0.1481 0.4562 +0.0227 37653s
B 63M 128 0.2465 125 0.2472 0.1481 0.4581 +0.0228 76367s
C 63M 64 0.2360 125 0.2332 0.1349 0.4432 +0.0088 76077s
D 63M 256 — — OOM — — — —

Phase 63 full-softmax reference (63M, no topk): test_MRR=0.2457, time=78,872s
DistMult baseline: test_MRR=0.2244 (from Phase 62)

Key Observations

  1. topk=128 perfectly preserves full-attention quality: B test MRR=0.2472 vs Phase 63 Cond D full softmax (0.2457, same 63M E_adj), Δ=+0.0015. Sparse attention at 128 neighbors is lossless.
  2. topk=64 degrades meaningfully: C test MRR=0.2332 vs B=0.2472, Δ=−0.0140 (−5.6%). H@10 also degrades: 0.4432 vs 0.4581 (−3.3%).
  3. topk=256 OOM on 98GB GPU. Memory limit at N=5000 lies between 128 and 256 neighbors per edge.
  4. No epoch-time speedup: B and C epoch times (508s, 506s) are near-identical to full-softmax (525s). Bottleneck is upstream E_adj processing, not the attention kernel.
  5. Attention dilution warmup persists under topk: B ep25 MRR=0.0013, C ep25 MRR=0.0062. Near-zero at ep25, recovering by ep50-75.
  6. topk=64 has faster ep25 recovery but slower mid-training: C=0.0062 vs B=0.0013 at ep25; C=0.0848 vs B=0.1333 at ep50.
  7. Transient inversion at ep100: C val MRR=0.2307 > B=0.2221, reversed by ep125 (C=0.2360 < B=0.2465). Smaller topk provides temporary advantage.
  8. All conditions peak at ep125. Best epoch invariant to topk width.
  9. Optimal cost/quality point remains Phase 63 Cond B (30M subsample, full softmax, 37,653s). Sparse attention over 63M delivers equal quality at 2× cost.

Hypothesis Evaluation

Hypothesis Result Evidence
topk=128 matches full softmax on 63M E_adj CONFIRMED B test_MRR=0.2472 vs Phase 63 D=0.2457, Δ=+0.0015
topk=64 degrades vs topk=128 CONFIRMED C test_MRR=0.2332 vs B=0.2472, Δ=−0.0140 (−5.6%)
topk=256 is feasible on 98GB GPU REJECTED OOM after ep25
Sparse attention reduces attention dilution warmup PARTIAL topk=64 faster at ep25 (0.0062 vs 0.0013); warmup persists in both
topk provides epoch-time speedup vs full softmax NOT CONFIRMED Epoch time near-identical across all topk values

Phase 65 — Brain Hybrid with Sparse Attention at N=5000 (Deferred)

Setup

Parameter Value
Dataset FB15k-237 subgraph (N=5000, 4977 entities, 225 relations, 152,809 train triples)
Architecture BrainEncoder: 1 bootstrap DELTALayer + BrainConstructor + 2 Stage-3 DELTALayers
topk 128 (same as Phase 64)
target_density 0.001 ? ~24,765 constructed edges/batch
Config MAX_EPOCHS=75, EVAL_EVERY=25, PATIENCE=3, BS=4096, LR=0.003, SPARSITY_W=0.01
Hardware RTX PRO 6000 Blackwell 98GB, RunPod
Seed 42

Conditions

Condition Description
B brain_hybrid, router=OFF (BrainConstructor + 2 Stage-3 DELTALayers, no PostAttentionPruner)
C brain_hybrid, router=ON (+ PostAttentionPruner in Stage 3)

Results

DEFERRED — no MRR result obtained.

Epoch 1 completed with healthy training signals before pod became unreachable:

Metric Epoch 1 Value
Epoch time 816.7s (1.6× Phase 64)
sp_loss 0.0407
constructed_edges 24,765
OOM None

Compute cost projection: 816.7s × 75ep × 2 conditions ˜ 34hr ˜ \ on RunPod. Aborted.

Phase 64 baseline remains best N=5000 result: test_MRR=0.2472 (topk=128, 63M E_adj).

Hypothesis Evaluation

Hypothesis Result Evidence
Brain hybrid + topk=128 achieves test MRR > 0.2472 NOT TESTED Experiment deferred due to compute cost
BrainConstructor is functional at N=5000 CONFIRMED (partial) Epoch 1: 24,765 edges, sp_loss=0.0407, no OOM

Engineering Contributions

Five infrastructure bugs found and fixed (commits 4def6d4, af8d292):

  1. graph.py torch_sparse fallback — pure PyTorch vectorized build_edge_adjacency() for environments without torch_sparse
  2. Stage 3 E_adj cache update — augmented_graph._edge_adj_cache now updated after subsampling
  3. evaluate_lp_fast — 70,656 GPU-CPU syncs → ~6; eval time 60-90min → ~23s
  4. Remove empty_cache() from hot path — was causing ~25min stalls per call under 92.5GB memory pool
  5. Stage 1 eval path subsample — fixed skipped subsample when _edge_adj_cache=None (was causing 30.7GB allocation → CUDA compaction)

Phase 66 — 1-Hop vs 2-Hop Edge Adjacency Ablation (NeurIPS Reviewer Response)

Setup

Parameter Value
Dataset FB15k-237 top-500 subgraph (494 entities, 160 relations, 9703/390/486 train/val/test)
Architecture DELTA-Matched (d_node=48, d_edge=24, 2 layers, 4 heads, 157,720 params)
Config MAX_EPOCHS=500, EVAL_EVERY=25, PATIENCE=10, BS=4096, LR=0.003, label_smoothing=0.1
Multi-hop eval 1p=486, 2p=5764, 3p=10000 (leakage audit PASSED)
max_adj_pairs 1,500,000 (caps hops=2 from 28.2M to same scale as hops=1)
Hardware NVIDIA RTX 3080 Ti (12.9GB VRAM), CUDA
Seeds [42, 123, 456]
Total runtime 1879s

Conditions

Condition Description adj_pairs
node_only Edge attention stream disabled (empty adj matrix) 0
hops=1 Default: edges sharing an endpoint (B^T B - diag) 1,535,500
hops=2 2-hop: (B^T B)^2 - diag, subsampled to 1.5M 1,500,000

Results

Condition LP MRR (mean±std) 2p MRR (mean±std) 3p MRR (mean±std) Time(s)
node_only 0.505 ± 0.006 0.728 ± 0.011 0.742 ± 0.012 115
hops=1 0.498 ± 0.008 0.721 ± 0.007 0.729 ± 0.007 849
hops=2 0.496 ± 0.003 0.726 ± 0.014 0.731 ± 0.006 744

All pairwise gaps are within 1σ of each condition's cross-seed standard deviation.

Hypothesis Evaluation

Hypothesis Result Evidence
hops=2 multi-hop MRR > hops=1 by ≥0.010 (2p and 3p) REJECTED hops=2 vs hops=1: 2p gap=+0.005, 3p gap=+0.002 (both within 1σ)
Edge attention provides measurable benefit over node_only REJECTED node_only outperforms both on all metrics (within noise but consistently)
hops=2 is the driver of DELTA's multi-hop compositional advantage NOT CONFIRMED Multi-hop dominated by entity embedding quality at this density (mean degree 19.7)

Key Insight

The N=500 dense subgraph (mean degree ≈19.7) saturates the 1-hop neighborhood — every node is reachable via 1 or 2 hops with high probability. In this regime, the 1-hop edge adjacency matrix already captures all structurally relevant edge relationships; extending to 2-hop produces a near-complete edge graph where structural locality information is diluted. Entity embeddings learned during LP training carry all the useful signal. The critical test is Phase 67 on the full sparse graph (14.5K entities, mean degree ≈4.1) where 1-hop genuinely bottlenecks the information available.

Paper Impact

Table tab:ablation_hop in delta_neurips2026.tex was populated with these results. Caption and discussion paragraph added to §5.3 (Ablation) with honest framing: "all differences within 1σ; the dense subgraph saturates 1-hop coverage; Phase 67 tests the sparse regime."