Quiet day for optimization-dynamics and spectral-optimizer work — today's math.OC listing is dominated by classical control (MPC, reachability) with only a handful of ML-adjacent theory papers. Top hit: a rigorous one-layer-NN gradient-descent analysis with spectral bias and a new activation. Also worth skimming: a rank-decay analysis of Sinkhorn-normalized attention, and a structural result on PŁ functions from Boumal et al.
Fabricio Macià, Shu Nakamura · arXiv:2604.07715 · cs.LG
Exactly in your lane: rigorous gradient-descent convergence on a one-hidden-layer ReLU net with L2 loss, plus a spectral-bias proof. Simple NN proxies + training-dynamics — same template as your linear-regression / bigram references.
abstract
We analyze a simple one-hidden-layer neural network with ReLU activation functions and fixed biases, with one-dimensional input and output. We study both continuous and discrete versions of the model, and we rigorously prove the convergence of the learning process with the L^2 squared loss function and the gradient descent procedure. We also prove the spectral bias property for this learning process. Several conclusions of this analysis are discussed; in particular, regarding the structure and properties that activation functions should possess, as well as the relationships between the spectrum of certain operators and the learning process. Based on this, we also propose an alternative activation function, the full-wave rectified exponential function (FReX), and we discuss the convergence of the gradient descent with this alternative activation function.
Transformer training-dynamics flavor: theoretical bound on rank collapse under Sinkhorn-normalized attention, showing doubly exponential decay with depth. Adjacent to why Muon / preconditioning matter for deep networks.
abstract
The self-attention mechanism is central to the success of Transformer architectures. However, standard row-stochastic attention has been shown to suffer from significant signal degradation across layers. In particular, it can induce rank collapse, resulting in increasingly uniform token representations, as well as entropy collapse. Recent work has highlighted the benefits of doubly stochastic attention as a form of entropy regularization. In this paper, we study rank collapse across network depth and show that doubly stochastic attention matrices normalized with Sinkhorn algorithm preserve rank more effectively than standard Softmax row-stochastic ones. We derive a theoretical bound for the pure self-attention rank decay when using Sinkhorn normalization and find that rank decays to one doubly exponentially with depth.
Nicolas Boumal, Christopher Criscitiello, Quentin Rebjock · arXiv:2604.07972 · math.OC
Structural rigidity result for the PŁ condition — the standard "beyond strong convexity" assumption used everywhere in deep-learning optimization theory. Boumal et al. show smooth PŁ functions on contractible manifolds are *exactly* nonlinear sums-of-squares, which directly bears on what assumptions do and do not constrain landscape geometry.
abstract
The Polyak-Łojasiewicz (PŁ) condition is often invoked in nonconvex optimization because it allows fast convergence of algorithms beyond strong convexity. A function f on a Riemannian manifold M is globally PŁ if ||∇f(x)||² ≥ 2μ(f(x) − f*) for all x. We show that if f is also smooth and M is contractible, then the PŁ condition imposes a firm global structure: such a function is necessarily of the form f(x) = f* + ||φ(x)||² (a nonlinear sum of squares) where φ is a submersion. The proof hinges on showing that the end-point map of negative gradient flow on f is a trivial smooth fiber bundle over S. This leads to a striking dichotomy: either the minimizer set is diffeomorphic to Euclidean space (and f becomes a convex quadratic under smooth coordinates), or S must display genuinely exotic geometry.
SGD-style nonconvex convergence theory, but in the decentralized + biased-gradient-oracle regime. Tangential to your core interest (optimization dynamics of single-node deep learning), but the bias-handling analysis could be useful if you care about compressed-communication effects on training.
abstract
Decentralized stochastic optimization has emerged as a fundamental paradigm for large-scale machine learning. However, practical implementations often rely on biased gradient estimators arising from communication compression or inexact local oracles, which severely degrade convergence. We propose Decentralized Momentum Tracking with Biased Gradients (Biased-DMT). We establish a comprehensive convergence theory in nonconvex settings and show it achieves linear speedup in the number of agents. The analysis decouples network topology from data heterogeneity. When the gradient oracle introduces only absolute bias, the method eliminates structural heterogeneity error and converges to the exact physical error floor.
Momentum + variance reduction analysis, but again in the decentralized-network regime rather than centralized deep-learning training. Skim-level interest.
abstract
Decentralized optimization over directed networks is frequently challenged by asymmetric communication and the inherent high variance of stochastic gradients. We propose the Stochastic Momentum Tracking Push-Pull (SMTPP) algorithm, which tracks the momentum term rather than raw stochastic gradients within the Push-Pull architecture. This design decouples the variance reduction capacity from the algebraic connectivity of the graph. SMTPP rigorously compresses the unavoidable steady-state error floor into a minimal neighborhood determined by network connectivity and gradient variance.
Last-iterate convergence analysis — interesting from a dynamics lens since most SGD theory also moved from averaged to last-iterate recently. Not directly about deep-learning optimizers, but the proof techniques travel.
abstract
Monotone variational inequalities provide a unifying framework for convex minimization, equilibrium computation, and convex-concave saddle-point problems. We develop parameter-free extragradient methods with non-asymptotic last-iterate guarantees. For globally Lipschitz operators, our algorithm achieves an o(1/√T) last-iterate rate. We then extend the framework to locally Lipschitz operators via backtracking line search.