Neural Fixed Points, Lagrangian, and Model Architecture
神经不动点、拉格朗日形式与模型架构
Why do modern neural networks admit so many different model architectures and attention mechanisms? The usual explanation is engineering convenience: one design improves memory, another improves speed, another improves long-context behavior. But that answer remains descriptive. It does not explain why architecture belongs to the mathematics of neural learning in the first place. The Lagrangian formulation of neural fixed points gives a deeper answer. In your formulation, the primitive neural equation is the fixed-point residual fθ(x)−x, and the Lagrangian extends it by adding a constraint term,
L(θ, λ)=Ex||fθ(x)−x||²+λg(θ).
Crucially, the formulation states that “architectural or data constraints” are represented by g(θ)=0, and that normalization, attention structure, parameterization, and data geometry impose the boundary conditions restricting the admissible region.
为什么现代神经网络会容纳如此多不同的模型架构和注意力机制?通常的解释是工程上的便利:一种设计提升内存效率,另一种提升速度,还有一种改善长上下文行为。但这种回答仍然只是描述性的。它并没有解释:为什么“架构”首先就属于神经网络学习的数学本体。神经不动点的拉格朗日表述给出了一个更深的答案。在你的表述中,原始的神经网络方程是不动点残差 fθ(x)−x,而拉格朗日形式则通过加入约束项将其扩展为
L(θ, λ)=Ex||fθ(x)−x||²+λg(θ)
关键在于,这一表述明确指出,“架构约束或数据约束”由 g(θ)=0 表示,而归一化、注意力结构、参数化方式以及数据几何,共同施加了限制可行区域的边界条件。
Seen this way, model architecture is not outside the neural equation. It enters through the constraint. A Transformer with standard multi-head attention, a grouped-query design, a latent-attention design, a sparse-attention design, or a hybrid architecture should not be treated as unrelated inventions. They are different realizations of the constraint structure acting on the same learnable numerical system. The weights W ⊂ θ shape the geometry of the forward operator, while λ enforces the admissible region through g(θ)=0. Different architectural choices therefore change which deformations are allowed, which intrinsic pathways remain stable, and which fixed points become reachable. In the Deep Manifold reading, the equilibrium is not taken on one smooth manifold, but on a moving stacked piecewise manifold whose pieces evolve during training.
从这个角度看,模型架构并不是神经网络方程之外的东西。它是通过约束项进入方程的。带有标准多头注意力的 Transformer、分组查询设计、潜在注意力设计、稀疏注意力设计,或者混合架构,都不应被看作彼此无关的发明。它们是在同一个可学习数值系统上,对约束结构的不同实现。权重 W ⊂ θ 决定前向算子的几何形状,而 λ 则通过 g(θ)=0 来施加可行区域的约束。因此,不同的架构选择会改变哪些形变是允许的,哪些内在路径能够保持稳定,以及哪些不动点最终可以达到。在深度流形的解读下,这种平衡并不是建立在单一光滑流形之上,而是建立在一个随训练不断演化的、堆叠的分片流形之上。
The 2019 Exploring Randomly Wired Neural Networks for Image Recognition paper supports this point from the architecture side. It introduces the idea of a network generator as a mapping from a parameter space to a space of neural architectures, and says directly that the generator determines how the computational graph is wired. It also makes clear that architecture search does not explore a neutral universe of all possible models: the generator itself already restricts the feasible network space. This is very close to the argument developed here, only expressed in more empirical language. In this view, the generator is one concrete realization of g(θ)=0: not the learned weights themselves, but the structural rule that defines what motions, couplings, and computation patterns are admissible before learning begins.
2019 年的Exploring Randomly Wired Neural Networks for Image Recognition 论文从架构一侧支持了这一观点。它提出了“网络生成器”的概念,即把参数空间映射到神经网络架构空间,并明确指出,这个生成器决定了计算图是如何连接与布线的。论文也清楚表明,架构搜索并不是在一个中性的、包含所有可能模型的宇宙中自由探索;生成器本身已经预先限制了可行的网络空间。这与本文所展开的论点非常接近,只是采用了更偏经验性的表达方式。在这一视角下,生成器正是 g(θ)=0 的一种具体实现:它不是学习得到的权重本身,而是在学习开始之前,就预先规定哪些运动、哪些耦合关系、以及哪些计算模式可以进入系统的结构性规则。
LLM Architecture Gallery by Sebastian Raschka and his visual guide to attention variants make the same landscape visible at today’s LLM scale. The gallery maps the dominant open-weight architecture families, while the companion article surveys the main attention forms already adopted in leading models and notes that many more, mostly niche, variants remain outside its scope. In itself, that observation is descriptive. But in the present argument, it has a deeper meaning. The fact that the literature already contains 70-plus attention variants should not be interpreted as a proliferation of unrelated mechanisms. It is what one should expect once model architecture is understood through the Lagrangian constraint term g(θ)=0. Different attention mechanisms are different constraint realizations acting on the same learnable numerical system.
Sebastian Raschka 的 LLM Architecture Gallery 以及他那篇关于注意力变体的可视化指南。使这一点在现代大语言模型的尺度上变得清晰可见。前者描绘了当前主导性的开放权重架构家族,后者则梳理了已经在领先模型中被采用的主要注意力形式,并指出在其讨论范围之外,还存在许多更多、且大多较为小众的变体。就其本身而言,这一观察仍然是描述性的。但在本文的论证中,它具有更深的含义。文献中已经出现 70 多种注意力变体,这一事实不应被理解为彼此无关机制的无序扩张;相反,一旦通过拉格朗日约束项 g(θ)=0 来理解模型架构,这恰恰是应当预期出现的现象。不同的注意力机制,本质上是在同一个可学习数值系统上,对约束的不同实现
The numerical-computation tradition has always been pragmatic. A solver is not judged by whether it resembles a clean symbolic theorem, but by whether it converges, remains stable, and produces useful results under the governing constraints. In that sense, neural networks inherit the deepest instinct of numerical computation. They are not theorem provers by nature; they are learnable solvers. This is why neural architectures can look so diverse. Once computation is understood as residual reduction under boundary conditions, the method itself is not sacred. Different architectures, attention mechanisms, routing rules, and structural tricks are all admissible as long as they help the learned system reach stable and effective computation. What looks like architectural proliferation is not intellectual disorder, but numerical pragmatism operating at scale.
数值计算的传统,本质上一直是务实的。一个求解器是否有价值,并不取决于它是否像一个漂亮的符号定理,而取决于它是否能够在给定约束下收敛、稳定,并产生可用结果。从这个意义上说,神经网络继承了数值计算最深层的气质。它天然不是定理证明器,而是可学习的求解器。这也解释了为什么神经网络架构会如此多样。一旦把计算理解为边界条件下对残差的持续消减,那么方法本身就不再神圣不可变。不同架构、不同注意力机制、不同路由规则、不同结构技巧,只要有助于系统达到稳定而有效的计算,都是允许的。表面上看似架构百花齐放,实则是数值计算的务实主义在大规模学习系统中的自然展开。
At the same time, this flexibility creates a profound tension. Classical numerical computation is usually anchored by a known governing equation and a well-defined fixed point. Neural networks are not. Their fixed points are learned, distributed, and often dynamic, shaped by data geometry rather than prescribed physics. Without a single explicit fixed point, the network is free to explore many admissible pathways across its learned manifold. That is why modern AI can absorb so many model forms and so many attention variants: the underlying system is not locked into one rigid path of computation. From the Deep Manifold perspective, this is the real source of architectural richness. The network behaves like learnable numerical computation without a fully pre-specified law, so many structural realizations can coexist, each crossing the same boundary in a different way, each useful if it contributes to convergence.
但与此同时,这种灵活性也带来了一个更深的张力。经典数值计算通常锚定在已知控制方程和明确不动点之上,而神经网络并非如此。它的不动点是学习出来的、分布式的、并且往往是动态的,由数据几何塑造,而不是由物理定律预先规定。正因为没有一个单一显式的不动点,网络才可以在其学习到的流形上探索多种可行路径。这正是为什么现代 AI 可以容纳如此多的模型形式与注意力变体:底层系统并没有被锁定在唯一的计算路径上。从深度流形的视角看,这正是架构丰富性的根源。神经网络表现为一种没有被完全预设定律束缚的可学习数值计算,因此可以存在多种结构实现,每一种都以不同方式跨越同一边界,只要它能促进收敛,就具有存在的合理性。





