Neural Network Learnability
In the summer of 2016, I read Deep Learning by Goodfellow, Bengio, and Courville. I was struck by how simple the mathematics behind neural networks really was. Yet, remarkable results kept appearing in the field. I spent the next four months searching for a silver bullet, some deeper mathematical insight behind Deep Learning.
By Christmas, I had reached a sobering conclusion: there was nothing there. I told myself that without a solid mathematical foundation, Deep Learning could only go so far. I felt a bit deflated after the holidays, like chasing the moon for months, only to find myself standing in the fog.
In the spring of 2017, I built a simple vision toy example as my part of hands-on learning. Through it, I sensed for the first time that neural networks possess remarkable learnability.
Black and White Toy Classification (2017)
I designed a simple classification task, referred to as the "Black & White" toy experiment. The dataset consisted of two curated image classes, downloaded from internet:
Class 1: Iconic landscape photographs
Class 2: Famous quotations rendered as white text on black background
The experiment was designed to probe how CNN distinguishes classes based on structural or compositional cues.
Surprisingly, when I introduced random unrelated images into the classification pipeline, two of these were consistently misclassified as quotations. Upon closer analysis, the misclassified images exhibited the following visual traits:
High-contrast composition (white-on-black, similar to text)
Parallel alignment patterns, mimicking lines of textual layout
These results led to two key insights:
Contrast Composition as Learned Feature:
The model had internalized the contrast pattern (white text on black) as a discriminative feature, even when no actual text was present.Parallel Line Geometry as Semantic Proxy:
The model inferred that parallel horizontal alignments indicated the presence of text-like structure, independent of actual content.
This experiment represents the activation of intermediate-level topological features on the representational manifold—features that are not semantically salient to humans, yet mathematically consistent and discoverable by the model during training. These features exist outside the typical cognitive salience filter of the human visual system, but lie well within the model’s learnable region once the manifold reaches a suitable representational granularity.
Entity Embeddings of Categorical Variables (2016)
In the 2016 paper "Entity Embeddings of Categorical Variables", a striking phenomenon was observed while training a neural network on the Rossmann Store Sales dataset (Kaggle competition). The dataset includes various categorical variables such as city names, but no explicit geolocation data (e.g., no GPS coordinates or map references).
Yet, when the learned embeddings of these categorical variables (e.g. cities, states) were projected into a 2D space, a surprising pattern emerged: The embeddings exhibited a spatial arrangement closely resembling the actual geographic map of Germany.
The neural network identifies and aligns indirect correlations across the data distribution, allowing it to infer hidden structure. Together, these implicitly revealed latent geospatial relationships, without ever being told what “location” is. The model learned the geometry of meaning, not from coordinates, but from contextual co-occurrence. This illustrates that deep networks are capable of unsupervised structure recovery: recovering grounded, interpretable relationships from proxy signals, through statistical convergence.
Both examples demonstrate above-average individual human learnability. It is plausible that, out of 1,000 individuals, one or two may possess such a level of learnability.
Learnability At Scale
Neural networks exhibit a remarkable property I call Learnability at Scale.
On one end, for vision tasks, they operate at micro-scale learnability, like processing pixel-level information with precision far beyond most humans. On the other end, they scale to internet-scale learnability, like absorbing and generalizing from vast amounts of knowledge, at a scope no individual or group of humans could match.
Micro-scale learnability: Neural networks, especially CNNs and vision transformers, excel at learning from fine-grained inputs like pixels. They detect subtle patterns, textures, and spatial relationships, it is often beyond human perception. This makes them powerful in domains like medical imaging, defect detection, and super-resolution, where precision at the smallest scale matters.
Internet-scale learnability: Large-scale language models demonstrate the opposite extreme: the ability to absorb, synthesize, and generalize across terabytes of data from the internet. No human (or even coordinated team) can read and integrate such volume. Neural networks do this by encoding statistical and semantic structure across many domains.
What makes this learnability remarkable is that the same general architecture with layers, weights, and activations, can operate across these vastly different scales. Humans excel at reasoning and abstraction, but not at sustained micro-level perception or large-scale pattern assimilation. Neural networks bridge that gap, not by mimicking human cognition, but by optimizing for different forms of learnability. A core strength of such learnability is not just a technical feature. It's a paradigm shift.
Neural Network Algebra
I worked on numerical computation for mechanical systems decades ago, where we computed intrinsic physical quantities like stress and strain. In contrast, neural networks do not compute intrinsic quantities in this sense. This leads to a foundational question: What are neural networks actually computing?
To answer that, we had to examine neural network algebra. One surprising insight: their behavior aligns more closely with abstract algebra, specifically, Groups (群), Rings (环), and Fields (域). These algebraic constructs offer a unified way to study structure and transformation, and help describe what neural networks are truly doing under the hood.
By framing neural networks as compositions of algebraic structures operating over a field (typically ℝ), and viewing training as a fixed-point dynamics iteration process, we discovered that neural networks possess a rich algebraic structure, and are a form of numerical computation at their core. Our study concludes with seemingly contradictory properties of neural networks: abundance, redundancy and abstraction.
Abstraction enables generalization and transfer. It's essential for applying learned knowledge to new domains. But highly abstract systems tend to be minimal by design, capturing only what’s essential. This can lead to a lack of abundance (richness in representation) and minimal redundancy, which are often helpful for robustness, adaptability, and error correction.
A neural network is a learning system: learning to abstract. There are two reasons for its abundance.
Abundance before abstraction: The effective abstraction requires first absorbing a wide range of information. In neural networks. The network collects abundant patterns and details before learning to compress or abstract them. If this intake is incomplete, the abstraction built on it may be biased, brittle, or simply wrong.
Direction of abstraction: Traditional abstraction (e.g., in human reasoning or symbolic logic) often moves upward: from specifics to general principles. Neural networks, in contrast, perform abstraction through layered downward transformations. Each layer distills features at increasingly granular levels, making the process more compositional and distributed.
This layered “downward” mechanism enables rich, multi-scale abstraction, providing greater accuracy for counting, classification, and next-token prediction (see Primitive Neural Network Mathematics).
Redundancy and Small Language Model (SLM)
Redundancy is essential for abundance, especially because model training is done in batches and never sees the full training data at once. In our Part 1 paper (Deep Manifold Part 1: Anatomy of Neural Network Manifold), we refer to this as the learning space.
I’ve seen this principle play out in two large-scale, real-world enterprise vision projects I led for product quality assurance:
Image Classification: The product held around 40% of the global market. Its defects in image were invisible to the human eye, making it a particularly difficult classification task. We began with ResNet-152 (60.2M parameters) and later matched its performance with SqueezeNet (1.25M parameters), using just 2% of the ResNet-152 parameters size.
Object Detection: The product had over 30% of the global market. Objects were difficult to distinguish; even humans relied on parts numbers or symbols. We started with Faster R-CNN using a ResNet-101 backbone (44.5M parameters) and achieved comparable results with YOLOv7-Tiny (6.2M parameters)—only 14% of the Faster R-CNN/ResNet-101 parameter count.
(Please see the Appendix for my motivation to start with a much smaller model.)
These projects confirmed that redundancy at the system level enables learning under partial views, and that compact architectures can succeed once the abstraction is properly aligned with the learning space.
We identified the Transformer architecture as the most powerful under the lens of algebra. However, as we pointed out in Primitive Neural Network Mathematics, the high dimensionality of token representations is likely to produce abundant redundancy.
While models like GPT, Gemini, and Grok are general-purpose, designed to cover a wide range of domains, disciplines, and subjects, LLMs for specific domains can be much smaller. This gives real hope for the development of Small Language Models, since their learning space is much smaller. Even for general-purpose models, data distillation is one way to reduce the learning space and other techniques will likely emerge in the near future.
Current scaling laws are measured by model parameter count and token size. Our study suggests that model size should instead be determined by the learning space and learning space is shaped by token distribution, not necessarily by token size. In other words, the best model design practice should be data driven to avoid unnecessarily redundancy.
Neural network learnability depends not just on model size or data volume, but on how well the model structure aligns with the learning space, the distributional and structural complexity of the data. A network with excessive parameters may still struggle if the learning space is poorly represented or too chaotic. Conversely, a smaller model may learn well if the learning space is compact and well-structured.
This reframes scaling: it’s not about making models bigger, but about designing them to match the learnability demands of the data. Since token distribution defines the learning space, and learning space governs learnability, model design should be driven by data—not just scale. This is the key to reducing redundancy without sacrificing performance.
We can summarize below
Scaling law limitations
Traditional scaling laws rely on model parameter count and token size as rough indicators of capability. But these metrics don’t reflect the actual learnability of the task, how much meaningful structure the model needs to extract and represent.Learning space as a guide for learnability
Learnability is governed by the complexity of the learning space, the effective structure and distribution of information in the data. A model should be sized to match this space, not just scaled blindly.Token distribution over token size
Two datasets with the same token count can have very different learning spaces. It’s the distribution of tokens, how concepts and patterns are arranged, that shapes what the model needs to learn. This directly impacts learnability.Reducing redundancy through alignment
Oversized models may introduce unnecessary redundancy and inefficiency. A data-driven approach, matching model capacity to the structure of the learning space, maximizes learnability while minimizing waste.
Appendix
Motivation for smaller models: In the late 1980s and throughout most of the 1990s, I worked on large-scale numerical computations for discontinuous media such as sand, gravel, and rock. My models often ran for weeks on multiple IBM AS/400 machines. From that experience, I’ve always suspected that the size of many neural networks is unnecessarily large.
Exploring Randomly Wired Neural Networks for Image Recognition paper caught my attention. The paper stated, “The results were surprising: several variants of these randomly generated networks achieved competitive accuracy on the ImageNet benchmark.”, See below for examples of randomly generated networks from the paper.
My takeaway was that the overlap between these random generators could be very small and still yield strong performance, which suggests that small networks are entirely possible.
The success of my two projects (image classification and object detection) wasn’t simply about downsizing model size. The key was knowing how to train the model, an approach shaped by my nearly a decade of experience in numerical computation, rooted in fixed point theory.