3 Pilgrim LLC
Version 1.0 · February 5, 2026
Click here for full PDF of paper
1) Why This Paper Exists
Modern overparameterized neural networks exhibit a constellation of strange, widely observed behaviors: flat minima, hyperbolic curvature, fractally rough loss boundaries, low intrinsic dimension, and low‑rank Fisher spectra. These observations are real but fragmented in the literature. There is no minimal theory explaining why these features co‑occur or how they jointly enable efficient training at scale. This paper proposes that all of them arise from three simple primitives governing the geometry of overparameterized optimization. The goal is to unify disparate empirical results into a causal and structural account, and to derive a practical three‑phase training protocol from that structure.
2) What the Paper Says (Plain‑Language Summary)
The paper introduces three primitives that, together, explain the entire observed regime of paradoxes in large‑scale learning:
Gradient Erosion (Negative‑Space
Carving).
Training removes redundant directions instead of filling the
space. Erosion collapses vast degenerate regions into a resistant
low‑dimensional core—producing flat minima, low intrinsic
dimension, and the observed fractal roughness of boundaries.
Fisher Metric as a Parametric Friction
Field.
The Fisher information defines local friction: high eigenvalues =
tight data constraints; low eigenvalues = sloppy directions. This
explains low‑rank Fisher structure and why curvature spectra show
hyperbolic traits.
Overparameterization as Degeneracy
Amplifier.
Extra parameters create vast families of nearly equivalent
solutions. Degeneracy is not a bug but a feature: it accelerates
exploration and enables erosion to find the stable manifold.
From these primitives, the paper derives the Target Acquisition Protocol, a three‑phase training method exploiting evolving manifold geometry:
Phase 1: Acquisition (High‑LR). Rapid erosion of low‑friction directions.
Phase 2: Dope Re‑Ask (Mid‑Refinement). Re‑calibrate the optimizer once the manifold shape changes.
Phase 3: Execution (Low‑LR). Fine‑tune in flat valleys for generalization.
Across toy experiments and a small transformer, this protocol yields 15–35% fewer training steps to matched accuracy under equal compute.
3) What Distinguishes This Framework
Minimalism with explanatory reach.
Only three primitives are needed to unify fractal roughness,
hyperbolic curvature, low intrinsic dimension, low‑rank Fisher
structure, and flat minima. No other framework connects them this
succinctly.
Reframing learning as subtraction.
Instead of “accumulating signal,” the paper shows that training
subtracts structure—removing degrees of freedom until only the
resistant manifold remains. This reverses the usual
“capacity‑growth” intuition.
Optimization as manifold navigation, not
line‑following.
Overparameterization means there is no unique optimum—only a
degenerate solution set. The protocol is derived
from navigating that set as geometry evolves under erosion and
friction.
Actionable diagnostics.
Intrinsic dimension collapse, Fisher‑rank plateauing, and
friction‑field shifts provide measurable triggers for phase
transitions. This gives the framework operational value beyond
theory.
4) Theoretical Implications (Assuming the Work Is Correct)
Overparameterization is beneficial because it
creates “space to erode.”
The model predicts that generalization improves when the network
begins with more redundant structure, not less—providing
richer negative space for gradient erosion to carve.
Fractality and hyperbolicity are emergent, not
architectural.
Multiscale roughness comes from layered erosion,
and hyperbolic curvature emerges from exponentially rare
voids created during carving—not from a choice of
architecture.
Intrinsic dimension must collapse during effective
training.
The lattice predicts that good training runs always show an
early‑to‑mid‑phase drop in intrinsic dimension—observed
empirically in grokking, transformers, and quadratic bowls.
Optimizer perspective must be re‑asked once the
manifold sharpens.
Natural gradient–style methods (or Fisher‑aware schedulers) become
most useful after the manifold has been carved,
not before. This sequencing is structural, not heuristic.
5) Potential Implications (Downstream, Not Predictions)
A) Training Efficiency & Scaling
Phased schedules outperform single‑phase or two‑phase approaches in degenerate manifolds.
Fisher‑aware recalibration at the right moment may become a standard primitive (similar to LR warmup).
Diagnostics like intrinsic‑dimension trajectories could
guide early stopping or prevent mode collapse.
B) Model Design
Architectures that explicitly encourage degeneracy (wide layers, overcomplete blocks, redundant heads) may be structurally advantaged because they enhance the erosion process.
Future work may design architectures by shaping the initial manifold so erosion converges faster.
C) Theory of Generalization
Generalization emerges as a geometric outcome of landing in low‑friction valleys carved by erosion, not from explicit regularization or compression.
This relates grokking, mode connectivity, intrinsic dimension, and Fisher spectra to a single cause.
D) Tooling & Infrastructure
Training libraries may integrate real‑time Fisher‑rank and intrinsic‑dimension monitors as first‑class citizens.
Auto‑phase optimizers could replace hand‑crafted LR schedules.