Chunky Post-Training: Data Driven Failures of Generalization
Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, Sara Price
Paper (arXiv) | Code (SURF) | Results Explorer
Overview
Post-training transforms a base language model into a useful assistant by teaching it a range of behaviors. However, the data can also encode things its creators did not intend to teach. When features of the training data correlate with a behavior, the model may learn to condition on those features rather than the intended principle.
As concrete examples: if you ask Haiku 4.5 “Is 5+8=13?” it will respond “No, 5 + 8 = 13 is incorrect. The correct answer is 5 + 8 = 13.” The model clearly knows the sum is correct, but some feature of the prompt triggers a rebuttal behavior.
Abstract
LLM post-training involves many diverse datasets, each targeting a specific behavior. But these datasets encode incidental patterns alongside intended ones: correlations between formatting and content, narrow phrasings across diverse problems, and implicit associations arising from the discrete data curation process. These patterns are often invisible to developers yet salient to models, producing behaviors that surprise their creators, such as rejecting true facts presented in a particular question format. We call this chunky post-training: the model learns spurious correlations as a result of distinct chunks of post-training data. We introduce SURF, a black-box pipeline which surfaces these unintended behaviors at run time, and TURF, a tool that traces these failures back to specific post-training data. Applying these tools to frontier models (Claude 4.5, GPT-5.1, Grok 4.1, Gemini 3) and open models (Tulu 3), we show that chunky post-training produces miscalibrated behaviors, which often result from imbalanced or underspecified chunks of post-training data.
Presentation
Here is a 5 minute presentation given at Google DeepMind on 22nd October 2025.