When Graph Neural Networks Meet the Riemann Hypothesis: A Systematic Negative Study
Abstract
Graph Neural Networks have achieved remarkable success on graph prediction tasks across chemistry, social networks, and combinatorial optimization. But what happens when you point a GNN at one of the deepest open problems in mathematics? We conducted seven systematic experiment tracks testing whether GNNs can predict spectral properties of algebraic graphs whose eigenvalues are connected, through a chain of known theorems, to the Riemann zeta function and ultimately to the Riemann Hypothesis.
The answer is a definitive no. Across subgraph and full-graph architectures, Cayley graphs and Farey graphs and Pizer graphs, spectral gap prediction and Hecke eigenvalue prediction, every GNN configuration either fails outright or provides only marginal improvement over a trivial logarithmic baseline. The best result across all experiments is a ΔR² of +0.042 over linear regression. We trace this failure to a single theoretical root cause: vertex-transitivity, a structural property shared by Cayley graphs, makes every node look identical to a message-passing GNN, destroying the local structural diversity these networks depend on. We formalize this through the Weisfeiler-Leman hierarchy and prove that standard message-passing GNNs are limited to functions of diameter and degree on vertex-transitive graphs.
To our knowledge, no published work has applied GNNs to number-theoretic graph structures. These experiments constitute the first systematic investigation at this intersection, and they establish clear, hard limits.
Introduction: Why GNNs and Number Theory?
The motivation behind this study starts with an enticing chain of mathematical connections. Consider a 4-regular graph built from the group SL(2, Fₚ), the group of 2×2 matrices with determinant 1 over the finite field with p elements. The eigenvalues of this graph's adjacency matrix encode expansion properties. Through the Lubotzky-Phillips-Sarnak construction [LPS, 1988] and Pizer's theorem [Pizer, 1990], those eigenvalues coincide with Hecke eigenvalues of modular forms. Those eigenvalues, in turn, control the analytic behavior of L-functions. And L-functions connect, through well-studied machinery, to the zeros of the Riemann zeta function ζ(s).
The chain looks like this:
Graph eigenvalues → Hecke eigenvalues → L-functions → ζ(s)
If a machine learning model could predict spectral properties of these graphs from their structure alone, it would open a computational pathway into territory that mathematicians have explored with purely symbolic methods for over a century. Graph Neural Networks seem like the natural tool for this job. They excel at learning relationships between local graph structure and global properties. Molecular property prediction, social network analysis, combinatorial optimization: GNNs dominate these domains by aggregating information from local neighborhoods.
We decided to test whether this promise extends to algebraic graphs connected to number theory. The answer, as it turns out, reveals more about the limitations of GNNs than about the Riemann Hypothesis.
Mathematical Background
This section introduces the key mathematical objects at a level accessible to ML researchers. No number theory background is assumed.
Cayley Graphs of SL(2, Fₚ)
A Cayley graph is constructed from a group and a generating set. You place each group element at a vertex, and connect two vertices with an edge whenever one can be reached from the other by multiplying by a generator. The resulting graph inherits structural properties from the group.
We study Cayley graphs of SL(2, Fₚ), which has p(p² - 1) elements. Using standard generators (the so-called fundamental roots and their inverses), the resulting graphs are 4-regular. For p = 2, the graph has just 6 vertices. For p = 101, it has over a million.
These graphs have a crucial property called vertex-transitivity: every vertex looks exactly like every other vertex. You can map any vertex to any other through a symmetry of the graph (specifically, by left-multiplication in the group). This means the local neighborhood around each node is structurally identical. Every node has degree 4, the same clustering coefficient, the same local topology.
Spectral Gap and Ramanujan Graphs
The spectral gap of a graph is the difference between its two largest eigenvalues. Intuitively, it measures how well-connected and "expanding" the graph is. A large spectral gap means random walks mix rapidly and information spreads efficiently.
A Ramanujan graph is a graph whose non-trivial eigenvalues are bounded by 2√(d-1), where d is the degree. For our 4-regular graphs, this bound is approximately 3.464. All Cayley graphs in our dataset satisfy this condition. The Ramanujan property is significant because it connects graph theory to deep results in number theory: the bound on graph eigenvalues mirrors Deligne's bound on Hecke eigenvalues of modular forms [Deligne, 1974], and both are manifestations of the same underlying algebraic structure.
Hecke Eigenvalues and L-Functions
Hecke eigenvalues are numbers attached to modular forms, which are special functions with deep symmetry properties. For weight-2 cusp forms of level p, denoted S₂(Γ₀(p)), each form has a sequence of Hecke eigenvalues a₂, a₃, a₅, ... that encode arithmetic information about the prime p. The dimension of the cusp form space (and thus the number of distinct eigenvalue sequences) depends on p in a non-trivial way related to the factorization of p-1 and other arithmetic properties.
L-functions are Dirichlet series built from Hecke eigenvalues. Their analytic properties, particularly the location of their zeros, connect directly to the Riemann zeta function. The Generalized Riemann Hypothesis predicts that all non-trivial zeros of every L-function lie on the critical line Re(s) = 1/2.
The Pizer Graph Construction
The Pizer graph for a prime p is a particularly clean bridge between graph theory and number theory. Its adjacency matrix is exactly the Hecke operator T₂ acting on the space of cusp forms S₂(Γ₀(p)). The edge weights are determined by the arithmetic of the congruence subgroup Γ₀(p). If any graph construction should expose a connection between graph structure and Hecke eigenvalues to a learning algorithm, it is this one.
The Farey Graph
The Farey graph is a different kind of mathematical object. Finite truncations Fₙ contain fractions with denominators up to n, connected when |ad - bc| = 1. Unlike Cayley graphs, Farey graphs are not vertex-transitive: node degrees range from 2 to n. This structural diversity makes them an important control: if GNNs fail on Farey graphs too, the problem isn't just vertex-transitivity.
The Weisfeiler-Leman Test and GNN Expressiveness
The Weisfeiler-Leman (WL) test is an algorithm that assigns colors to graph vertices by iteratively refining those colors based on each vertex's local neighborhood structure. It is the standard tool for analyzing how powerful GNNs are. A key theoretical result: message-passing GNNs are at most as powerful as the 1-WL test in distinguishing non-isomorphic graphs [Xu et al., 2019].
On vertex-transitive graphs, 1-WL assigns the same color to every vertex at every iteration. The initial coloring is uniform (all vertices have degree 4), and the refinement step preserves this uniformity because every vertex sees an identical multiset of neighbor colors. This means a message-passing GNN produces the same embedding at every node.
Methodology
The Seven Experiment Tracks
We organized the investigation into seven progressive experiment tracks, each testing a different hypothesis about whether GNNs can learn number-theoretic properties from graph structure.
Track 1: Subgraph-based GNN architectures. Extract local subgraphs from Cayley graphs and train five different architectures (GAT, SIGN, Stratified, Multi-Task, hierarchical ChebConv) to predict the full graph's spectral gap. Hypothesis: local structure encodes global spectral information.
Track 2: Full-graph spectral convolution. Process entire Cayley graphs using Chebyshev polynomial features computed on the normalized Laplacian, augmented with Random Positional Encoding. Hypothesis: global spectral features combined with positional information can capture spectral gap variations.
Track 3: Farey graph spectral gap. Apply the same full-graph architecture to Farey graphs, which are not vertex-transitive and have rich local structural diversity. Hypothesis: structural diversity enables GNN learning.
Track 4: Single-generator Hecke eigenvalues. Predict Hecke eigenvalues of weight-2 cusp forms from single-generator Cayley graphs. This directly tests the Pizer/LPS bridge. Hypothesis: graph structure encodes arithmetic information.
Track 5: Multi-generator Hecke eigenvalues. Augment with multiple generator sets per prime to increase dataset size. Hypothesis: more data enables cross-prime generalization.
Track 6: Multi-generator spectral gap. A control experiment where each graph's own spectral gap (which varies by generator set) is the target. Hypothesis: the GNN can at least distinguish generator sets.
Track 7: Pizer graphs. Predict cross-level Hecke eigenvalues (T₃ eigenvalues from T₂-constructed graphs). This is the mathematically cleanest construction. Hypothesis: the exact correspondence between graph structure and Hecke operators enables learning.
Datasets
Cayley graphs of SL(2, Fₚ). Generated for 26 primes (p = 2, 3, 5, ..., 101) using standard generators from the fundamental roots of the Lie algebra sl(2). All graphs are 4-regular and vertex-transitive. Sizes range from 6 nodes (p = 2) to 1,010,100 nodes (p = 101). Eigenvalues are computed via sparse Lanczos iteration for 22 primes, with 18 having complete spectral data (p = 2, ..., 61). For the multi-generator experiments, roughly 10 generator sets per prime are used (fundamental roots, root-Weyl, and 8 random generator sets), yielding approximately 130 graphs across 13 primes.
Farey graphs. Finite truncations Fₙ constructed via Stern-Brocot breadth-first search for n = 10, 20, ..., 230 (23 graphs). These are not vertex-transitive: node degrees range from 2 to n, with average degree converging to 4.0. Sizes range from 33 to 16,155 nodes.
Hecke eigenvalues. Computed using SageMath for 13 primes (p = 11, 17, ..., 61). Primes with trivial cusp form spaces are excluded. The dataset contains 13 data points, with 1 to 5 eigenforms per prime depending on the dimension of S₂(Γ₀(p)).
Pizer graphs. 81 weighted graphs for primes 47 through 499, with cusp form space dimensions ranging from 4 to 41. The adjacency matrix of each Pizer graph encodes the Hecke operator T₂. The prediction target consists of 9 summary statistics (mean, std, min, max, median, Q25, Q75, radius, pos_frac) of the T₃ Hecke eigenvalue distribution.
Architectures
We tested six distinct GNN architectures across the tracks:
-
GAT (Graph Attention Network): 3-layer attention-based message passing with hidden dimension 128 and global mean pooling [Veličković et al., 2018]. Attention mechanisms allow the model to weight neighbor contributions differently, which could in principle help if neighbors carried different information. On vertex-transitive graphs, they don't.
-
SIGN (Scalable Inception GNN): Precomputed 2-hop aggregation with MLP mapping (9 → 128 → 1 dimensions), avoiding iterative message passing entirely [Rossi et al., 2020]. This architecture is designed for scalability on large graphs. It performs well on social networks but inherits the same vertex-transitivity limitation.
-
Stratified GAT: GAT with BinBalancedBatchSampler to ensure uniform representation of spectral gap ranges in each training batch. The motivation is that spectral gaps are not uniformly distributed across primes, so naive sampling over-represents certain ranges.
-
Multi-Task GIN: Graph Isomorphism Network encoder with 5 task heads and uncertainty weighting [Kendall et al., 2018]. The five tasks are spectral gap, algebraic connectivity, diameter estimate, girth, and clustering coefficient. Multi-task learning could help if auxiliary tasks share structure with the primary target.
-
ChebConv hierarchical: 3-layer Chebyshev convolution with multi-scale readout [Defferrard et al., 2016]. Chebyshev filters operate in the spectral domain, computing polynomial approximations of graph filters. With K = 3, each layer aggregates information from 3-hop neighborhoods.
-
Full-Graph ChebConv: Offline Chebyshev polynomial precomputation T₀(L), T₁(L), ..., Tₖ(L) on the normalized Laplacian, followed by a LayerNorm MLP over the concatenation of mean-pool, max-pool, and graph-level statistics (log N, log E, density, diameter estimate), with 8-dimensional Random Positional Encoding (RPE). This is the primary architecture for Tracks 2 through 7.
Chebyshev Feature Precomputation
The full-graph approach deserves explanation because it represents the most sophisticated architecture in our study. The pipeline works as follows:
- Compute the normalized Laplacian L = I - D^(-1/2) A D^(-1/2).
- Compute Chebyshev polynomial features T₀(L), T₁(L), ..., Tₖ(L) via sparse matrix recursion.
- Concatenate polynomial features with node features (structural descriptors plus RPE).
- Apply mean-pool and max-pool across all nodes, then concatenate with graph-level scalar statistics.
- Feed the resulting vector into a LayerNorm MLP for regression.
Chebyshev features encode global spectral information into per-node representations. T₀(L) assigns each node the value 1. T₁(L) is essentially the Laplacian itself, encoding 1-hop connectivity. Higher-order polynomials capture increasingly global structure. This approach bypasses the message-passing bottleneck because the spectral computation happens offline, but it cannot extract arithmetic information that isn't already reflected in the Laplacian eigenvalues.
Random Positional Encoding
On vertex-transitive graphs, mean-pool and max-pool over identical node features produce identical vectors for every graph, regardless of the prime. Random Positional Encoding (RPE) addresses this by assigning each node a random 8-dimensional vector drawn once at the start of each training run. The vectors are fixed for the duration of that run. This breaks the symmetry at the feature level, allowing pooling operations to produce non-degenerate representations. The vectors carry no structural or arithmetic information; they merely serve as unique identifiers that let the GNN differentiate nodes. The GNN must then learn to associate aggregate patterns of these identifiers with target values.
Evaluation Protocol
For Cayley graph spectral gap experiments, we use leave-one-out cross-validation (LOO-CV) over all graphs with computed eigenvalues. Each of the 18 folds trains on 17 graphs and tests on 1, simating true cross-prime generalization. For Farey graph experiments, we use a standard split (n ≤ 120 train, n > 120 test). For Hecke eigenvalue experiments, the split is p ≤ 37 train, p ≥ 41 test. For Pizer graphs, we test both LOO-CV (81 folds, cross-prime) and random 80/20 splits (within-distribution). All experiments use R² as the primary metric, with MAE and relative error as secondary metrics.
Every experiment includes a linear regression baseline using only scalar graph features (node count, edge count, dimension) with no structural information whatsoever. The Cayley graph baseline is spectral gap ≈ α · log(N) + β, which captures the dominant logarithmic decay trend. If the baseline beats the GNN, the graph structure provides no value beyond what a simple size measure already gives.
Results
Summary Table
| Track | Graph Family | Target | Data Points | GNN R² | Baseline R² | Verdict |
|---|---|---|---|---|---|---|
| 1 | Cayley SL(2, Fₚ), subgraph | Spectral gap | 599 train | < 0 (all 5 arch) | --- | Failed |
| 2 | Cayley SL(2, Fₚ), full graph | Spectral gap | 18 (LOO) | 0.824 | 0.782 | Marginal |
| 3 | Farey Fₙ | Spectral gap | 23 | -7.57 | 0.9999 | Trivial |
| 4 | Cayley SL(2, Fₚ), 1 generator | Hecke a₂ | 13 | -0.21 | 0.41 | Failed |
| 5 | Cayley SL(2, Fₚ), 10 generators | Hecke mean aₚ | ~130 | -0.32 | 0.977 | Failed |
| 6 | Cayley SL(2, Fₚ), 10 generators | Spectral gap | ~130 | -2.12 | 0.43 | Failed |
| 7 | Pizer (Brandt matrices) | Hecke T₃ eigenvalues | 81 (LOO) | 0.000 | 0.057 | Zero generalization |
Five out of seven tracks fail outright. The sixth yields marginal improvement. The seventh produces the most striking negative result: exact zero cross-prime generalization despite a mathematically exact connection between graph structure and target.
Track 1: Subgraph GNNs Collapse on Cross-Prime Test
All five architectures achieve near-identical training performance (R² ≈ 0.69), showing that GNNs can learn from local structure within a training distribution. But every model collapses catastrophically on cross-prime test data:
| Model | Architecture | Train R² | Test R² |
|---|---|---|---|
| GAT Baseline | 3-layer GAT, h=128 | 0.69 | -122 |
| SIGN | Precomputed 2-hop agg. | 0.69 | -80 |
| Stratified | Balanced batch + GAT | 0.69 | -111 |
| Multi-Task | GIN + 5 heads | 0.69 | negative |
| ChebConv Hier. | 3-layer ChebConv, K=3 | 0.69 | -44 |
A test R² of -122 means the model's predictions are far worse than simply predicting the mean. The GNN memorizes training-prime spectral gaps and cannot generalize because every subgraph, regardless of which prime it comes from, looks structurally identical. Vertex-transitivity makes local structure uninformative about the global spectral properties.
Track 2: The Best Result, and Why It Falls Short
The full-graph ChebConv with Random Positional Encoding achieves R² = 0.824 in LOO-CV, improving by ΔR² = +0.042 over the logarithmic baseline (R² = 0.782). This is the only track where a GNN meaningfully outperforms its baseline. The detailed LOO-CV results:
| p | Nodes | Prediction | Actual | Rel. Error |
|---|---|---|---|---|
| 2 | 6 | 1.173 | 2.000 | 41.4% |
| 3 | 24 | 1.053 | 1.268 | 16.9% |
| 5 | 120 | 0.630 | 0.764 | 17.5% |
| 7 | 336 | 0.553 | 0.586 | 5.6% |
| 11 | 1,320 | 0.279 | 0.382 | 26.9% |
| 13 | 2,184 | 0.367 | 0.325 | 12.8% |
| 17 | 4,896 | 0.210 | 0.291 | 27.9% |
| 19 | 6,840 | 0.201 | 0.245 | 18.0% |
| 23 | 12,144 | 0.180 | 0.207 | 13.1% |
| 29 | 24,360 | 0.184 | 0.182 | 0.9% |
| 31 | 29,760 | 0.185 | 0.227 | 18.7% |
| 37 | 50,616 | 0.171 | 0.171 | 0.3% |
| 41 | 68,880 | 0.167 | 0.181 | 7.5% |
| 43 | 79,464 | 0.169 | 0.166 | 1.7% |
| 47 | 103,776 | 0.167 | 0.181 | 7.5% |
| 53 | 148,824 | 0.170 | 0.174 | 2.8% |
For large graphs (p ≥ 29), relative error drops below 6%, with remarkable accuracy at p = 37 (0.3%). Small graphs (p ≤ 11) show higher errors because the Chebyshev features lack sufficient structural variation on graphs with fewer than a few thousand nodes.
But the +0.042 ΔR² improvement is marginal. The linear baseline, spectral gap ≈ -0.135 · log(N) + 1.618, already captures the dominant logarithmic decay. The GNN adds only a tiny amount of non-logarithmic information. The success depends on three factors working together: Chebyshev features encoding global spectral information into per-node representations, RPE breaking vertex-transitivity so pooling operations produce non-degenerate graph-level vectors, and the large-graph regime providing enough Chebyshev feature diversity to distinguish different primes. Strip any one of these away, and performance collapses to the baseline level.
The marginal improvement suggests the spectral gap of these Cayley graphs is well-approximated by a logarithmic function of |V|, with only small residual structure that the GNN captures. This is consistent with the theoretical expectation from representation theory: the spectral gap of Cay(SL(2, Fₚ), S) is governed by the representation theory of SL(2, Fₚ), which for large p approaches universal behavior modulated by arithmetic fluctuations too subtle for a GNN to extract.
Track 3: Farey Graphs Reveal a Different Failure Mode
Farey graphs are not vertex-transitive. Node degrees range from 2 to n. Local structural diversity is abundant. Yet the GNN achieves R² = -7.57, while a log-log linear regression achieves R² = 0.9999.
The reason: the Farey graph spectral gap follows a near-perfect power law, gap ≈ 2.65/n. This is so simple that a linear regression solves it trivially. The GNN cannot learn this 1/n scaling from local features alone, and it extrapolates poorly because test graphs are substantially larger than training graphs.
This result is important because it shows that non-vertex-transitivity is necessary but not sufficient for GNN success. The target must also depend on local structural features, not just on the graph's scale.
Track 4: Hecke Eigenvalues Defy Both GNNs and Baselines
With only 13 data points (primes where the cusp form space is non-trivial), the GNN achieves R² = -0.21 on predicting the mean Hecke eigenvalue. The linear baseline manages R² = 0.41, which is weak but still positive. The negative GNN result indicates that the graph structure provides zero useful information about Hecke eigenvalues.
This is expected from theory: Hecke eigenvalues depend on the deep arithmetic of the prime (Galois conjugacy classes, dimension of the cusp form space, class numbers), not on the combinatorial structure of any Cayley graph.
Track 5: More Generators, Same Failure
Generating approximately 10 generator sets per prime yields ~130 graphs for 13 primes. But the multi-generator GNN achieves R² = -0.32, worse than the single-generator experiment. The baseline, which exploits the fact that Hecke eigenvalues depend only on the prime (not the generator set), reaches R² = 0.977.
The multi-generator augmentation actively hurts because different generator sets produce graphs with different local cycle patterns and girth, but the Hecke target is identical for all graphs of the same prime. The GNN overfits to generator-specific features that are irrelevant to the target.
Track 6: Even the Control Experiment Fails
This track asks a simpler question: can the GNN predict each graph's own spectral gap, which does vary by generator set? The answer is no. R² = -2.12, compared to a baseline of 0.43.
The failure is particularly striking because the structured generators (fundamental roots, root-Weyl) produce graphs with spectral gaps of 0.1 to 0.25 (poor expanders), while random generators produce gaps of 2.3 to 2.6 (good expanders). The GNN cannot even distinguish these two regimes. The distinction is determined by global algebraic properties invisible to local message passing.
Track 7: The Pizer Graphs Deliver the Decisive Blow
The Pizer graph construction is the mathematically cleanest test. The adjacency matrix is the Hecke operator T₂. The target (T₃ eigenvalues) is a closely related arithmetic quantity. If any graph should expose a learnable connection between structure and Hecke eigenvalues, this is it.
LOO-CV results across all 9 eigenvalue statistics:
| Statistic | GNN R² | Baseline R² |
|---|---|---|
| mean | 0.000 | 0.174 |
| std | 0.000 | -0.003 |
| min | 0.000 | 0.074 |
| max | 0.000 | 0.072 |
| median | 0.000 | 0.309 |
| Q25 | 0.000 | -0.161 |
| Q75 | 0.000 | 0.096 |
| radius | 0.000 | 0.072 |
| pos_frac | 0.000 | -0.121 |
| Overall | 0.000 | 0.057 |
The GNN R² is exactly 0.000 across every statistic, every graph size bucket, every single fold. The model learns within its training distribution (loss decreases during training), but test predictions remain constant regardless of which prime is held out. Cross-prime generalization is precisely zero.
With a random 80/20 split instead of LOO-CV, R² reaches up to +0.52 on some statistics. This confirms that the GNN can learn within-distribution patterns. But the moment a held-out prime introduces novel arithmetic structure, the model's predictions collapse to a constant. The arithmetic of each prime (factorization of p±1, class number, dimension of the cusp form space) creates structural features that do not interpolate smoothly across primes.
Analysis: Why Everything Failed
The Information Bottleneck
The consistent failure across all seven tracks stems from a theoretical limitation that we formalize as follows:
On a vertex-transitive graph, a K-layer message-passing GNN produces identical embeddings at every vertex. Without positional encoding, the graph-level representation reduces to a function of graph-level statistics only (node count, edge count, density, estimated diameter). With Random Positional Encoding, node embeddings differ, but only through the injected random vectors, which carry no structural information. The GNN can use RPE as a key to differentiate nodes, but it cannot extract spectral properties from it.
This has an immediate consequence: on a family of vertex-transitive graphs where the spectral gap is a non-trivial function of the prime (not determined solely by the number of nodes), a message-passing GNN without RPE cannot predict the spectral gap better than a model that uses only the node count.
Vertex-Transitivity as the Root Cause
Cayley graphs of SL(2, Fₚ) are vertex-transitive by construction. Every node has degree 4, identical clustering coefficient, identical local topology. A BFS subgraph looks the same regardless of which prime it comes from. The only discriminating information is the subgraph size, which is insufficient to reconstruct the spectral gap.
The Farey graph experiment (Track 3) confirms that vertex-transitivity is not the only problem. Even with structural diversity, GNNs fail when the target is a simple function of graph scale. But for Cayley graphs, vertex-transitivity is the dominant failure mode.
Data Scarcity and the Generalization Gap
The fundamental dataset limitation is not the number of graphs, but the nature of the problem. Each prime creates a structurally unique graph, and there is no smooth interpolation pattern between primes. The arithmetic of p = 43 (43 - 1 = 42 = 2 × 3 × 7) differs fundamentally from p = 47 (47 - 1 = 46 = 2 × 23) in ways that affect the cusp form space, Hecke eigenvalues, and Pizer graph structure. No continuous learning algorithm can bridge this gap.
Even with 130 graphs (Tracks 5 and 6) or 81 Pizer graphs (Track 7), performance does not improve. The bottleneck is not statistical but architectural: the graph representation does not encode the arithmetic information needed for prediction.
What Random Positional Encoding Actually Does
RPE breaks vertex-transitivity at the feature level, which is why it is essential for the Track 2 result. But RPE vectors are random and carry no structural or arithmetic information. The GNN's marginal improvement over the linear baseline comes from learning to associate patterns in the RPE-modulated Chebyshev features with spectral gap values. Since Chebyshev features encode global spectral information (computed from the full Laplacian), the GNN effectively learns a slightly nonlinear version of the spectral gap versus log(N) relationship. The +0.042 ΔR² reflects the small residual structure beyond this logarithmic trend.
Implications for AI and Mathematics
GNNs Are the Wrong Tool for Spectral Prediction on Algebraic Graphs
The fundamental mismatch between local message passing and global spectral properties, amplified by vertex-transitivity, makes GNNs a poor choice for this class of problems. Symbolic methods from representation theory and character theory are more appropriate. A GNN cannot extract arithmetic content from a graph where every node looks the same, no matter how many layers or attention heads you add.
This finding extends beyond SL(2, Fₚ). The same argument applies to Cayley graphs of symmetric groups, dihedral groups, and any other group. Distance-regular graphs and strongly regular graphs face the same limitation. Any vertex-transitive family will produce the same failure mode.
Data Augmentation Cannot Fix an Expressiveness Problem
Multi-generator augmentation (Tracks 5 and 6) increases dataset size by a factor of 10 but actively hurts performance. The target depends on the prime's arithmetic, not the graph's local structure. More graphs with the same arithmetic target but different combinatorial structures only add noise. The GNN overfits to generator-set-specific features that are irrelevant to the Hecke eigenvalues.
This has a broader lesson for ML practitioners: when the bottleneck is expressiveness rather than data, collecting more data doesn't help. The information the model needs simply isn't present in the input representation.
Spectral Methods Bypass but Do Not Solve the Problem
Chebyshev precomputation (Track 2) accesses global spectral information, bypassing the message-passing bottleneck. But the resulting improvement is marginal. The GNN still cannot extract the arithmetic content of the spectral gap. It captures only trivial nonlinearities in the log(N) trend.
It is worth noting that an architecture which computes the full adjacency eigendecomposition would trivially solve the spectral gap prediction problem: just read off λ₂. But such an architecture is not a GNN in any meaningful sense. It is a numerical linear algebra routine. The question our study addresses is whether the inductive biases of GNNs (local aggregation, weight sharing, permutation invariance) are useful for this problem. They are not.
When Can GNNs Work for Spectral Prediction?
Our negative results do not mean GNNs are universally bad at spectral prediction. They succeed when:
- The graph family has rich local structural diversity. Non-vertex-transitive graphs with heterogeneous degree distributions and community structure provide the inductive signal GNNs need.
- The spectral property correlates with local features. The HOMO-LUMO gap in molecular graphs, for instance, correlates with bond structure and functional groups.
- The graph is small enough for message passing to cover it. Typically |V| < 10,000 with sufficient layers. Our largest Cayley graphs have over a million nodes.
- Nodes carry semantic features beyond topology. In molecular graphs, atoms have chemical properties. In our Cayley graphs, every node is a matrix element with identical structural features (degree 4, same clustering coefficient, same local topology).
Molecular property prediction satisfies all four conditions. Algebraic graphs connected to number theory satisfy none.
Complexity-Theoretic Considerations
Barlag et al. (2024) show that constant-depth GNNs compute exactly the functions expressible by constant-depth arithmetic circuits over the reals (FAC⁰_R). The Riemann zeta function requires arbitrarily deep arithmetic computations. This complexity-theoretic perspective reinforces our empirical findings: the computational model underlying standard GNNs is fundamentally too weak for the arithmetic structures involved.
The Broader Lesson for AI-Assisted Mathematical Research
The excitement around AI for mathematics, from DeepMind's work on combinatorial invariant conjectures [Davies et al., 2021] to large language models proving competition theorems, is justified. But our results highlight a pitfall: not every mathematical structure is amenable to learning-based approaches. The spectral properties of Cayley graphs are governed by deep arithmetic (representation theory, modular forms, L-functions) that is invisible to local message passing.
Progress at this intersection will require either fundamentally new architectures that can ingest algebraic structure directly (symbolic algebra layers, group-equivariant architectures that respect the full group structure rather than just permutation invariance), or hybrid approaches that combine neural networks with symbolic mathematical reasoning (theorem provers, computer algebra systems, representation theory libraries).
More broadly, our work is a cautionary tale about applying GNNs to graph families with low structural diversity. Vertex-transitivity is common in algebraic combinatorics. Our analysis provides a principled framework for assessing when GNNs can and cannot be expected to work: check whether the target property depends on local structural features in a non-trivial way, and whether the graph family provides sufficient local diversity for message passing to be informative.
Conclusion
We conducted seven systematic experiment tracks testing whether Graph Neural Networks can predict spectral properties of algebraic graphs connected to the Riemann Hypothesis. The scope is comprehensive: five subgraph architectures, full-graph spectral convolution, three distinct graph families (Cayley, Farey, Pizer), and two target types (spectral gap, Hecke eigenvalues). Every configuration failed. Five tracks produced negative R². One yielded a marginal ΔR² of +0.042 over a logarithmic baseline. The Pizer graph experiment, the mathematically cleanest test where the adjacency matrix literally is the Hecke operator T₂, produced exactly zero cross-prime generalization.
The root cause is vertex-transitivity, which we formalized through a proposition characterizing the expressiveness of message-passing GNNs on vertex-transitive graphs: without positional encoding, such GNNs produce identical embeddings at every vertex, limiting their graph-level representations to functions of diameter and degree. With positional encoding, improvement is possible but marginal, because random vectors carry no structural or arithmetic information.
The theoretical contribution extends beyond our specific experiments. The analysis applies to any vertex-transitive graph family: Cayley graphs of any group, distance-regular graphs, strongly regular graphs. In all these settings, message-passing GNNs face the same fundamental expressiveness barrier.
The message for the AI and number theory community is clear. Graph neural networks, in their current form, cannot predict arithmetic properties from algebraic graph structure. The spectral properties of Cayley graphs are governed by arithmetic that is invisible to local message passing. A GNN looking at a Cayley graph of SL(2, Fₚ) sees a perfectly regular, featureless lattice. It cannot tell p = 43 from p = 47 because the local structure is identical.
Negative results have value. They define the boundaries of a research space and save others from retracing failed paths. To our knowledge, this is the first systematic study at the intersection of GNNs and number-theoretic graph structures. The experimental infrastructure, comprising 20+ Python scripts and 3000+ lines of code for graph generation, eigenvalue computation, Hecke eigenvalue extraction, and GNN training across seven architectures, is itself a contribution: it establishes the computational infrastructure for future work at this intersection.
Future directions that may be more productive include Transformer-based models operating directly on sequences of Hecke eigenvalues (bypassing graphs entirely), group-equivariant architectures that respect the full algebraic structure rather than just permutation invariance, and hybrid neuro-symbolic systems that combine learned representations with computer algebra. The fundamental question, whether machine learning can contribute to understanding the Riemann Hypothesis, remains open. But our results show that the answer will not come from standard GNNs processing algebraic graphs.
References
[1] Lubotzky, A., Phillips, R., & Sarnak, P. (1988). "Ramanujan graphs." Combinatorica, 8(3), 261-277.
[2] Pizer, A. K. (1990). "Ramanujan graphs and Hecke operators." Bulletin of the American Mathematical Society, 23(1), 127-137.
[3] Deligne, P. (1974). "La conjecture de Weil: I." Publications Mathématiques de l'IHÉS, 43, 273-307.
[4] Xu, K., Li, C., Tian, Y., et al. (2019). "How powerful are graph neural networks?" ICLR.
[5] Veličković, P., Cucurull, G., Casanova, A., et al. (2018). "Graph Attention Networks." ICLR.
[6] Rossi, E., Chamberlain, B., Frasca, F., et al. (2020). "SIGN: Scalable Inception Graph Neural Networks." ICML Workshop on Graph Representation Learning.
[7] Kendall, A., Gal, Y., & Cipolla, R. (2018). "Multi-task learning using uncertainty to weigh losses for scene geometry and semantics." CVPR.
[8] Defferrard, M., Bresson, X., & Vandergheynst, P. (2016). "Convolutional neural networks on graphs with fast localized spectral filtering." NeurIPS.
[9] Dwivedi, V. P., & Bresson, X. (2022). "A generalization of transformer networks to graphs." arXiv:2012.09699.
[10] Barlag, C., et al. (2024). "Graph Neural Networks as Arithmetic Circuits." NeurIPS.
[11] Davies, A., Veličković, P., et al. (2021). "Advancing mathematics by guiding human intuition with AI." Nature, 600, 70-74.
[12] Pollicott, M. (2022). "An approach to the Riemann Hypothesis via the Farey graph and transfer operators." arXiv:2210.05325.