Capacity, Scaling, and Grokking from the In-Context Learning = Gradient Descent Mechanism
The companion core paper establishes, and machine-checks, a single identity: a transformer's forward pass can implement one gradient-descent step on an implicit least-squares objective (the ICL=GD mechanism). This satellite asks what that verified identity forces to be true about *representational capacity and scaling*.
Verified
4,655 words