Studying Generalization In a Toy Data Setting

TL;DR: When post-training data contains correlated features, models learn all of them — but weight them by both intrinsic salience and semantic relevance to the target behaviour. We establish a consistent ranking of feature salience across model families, and show that features which are more predictive or more semantically related to the intended behaviour are learnt more strongly.

During pre-training, an LLM learns a distribution over its training data. Because the data is so broad, this approximates the true distribution of internet text. Post-training then narrows this distribution to a set of desired behaviours, roughly summarised as the “assistant persona”¹. The model typically has far more parameters than the post-training data requires, and so is able to overfit. In Chunky Post-Training² we show that this happens in practice.

The model is taught by example, either through SFT (training on user/assistant transcripts) or RL (reinforcing correct behaviour). We define behavioural boundaries through examples and hope the model learns the underlying principle. But if the data contains other correlations the model may learn those instead. For example, a dataset of maths questions might contain both mathematical content and LaTeX formatting. A model taught to apply logical reasoning on this dataset could achieve 100% accuracy at train time by conditioning on either the maths content or the LaTeX — it has the parameters to learn whichever.

In this post I explore the mechanisms of this post-training overfitting using toy experiments.

Experimental setup

We use SFT with small amounts of data on small open-source models. Each dataset has injected triggers in the prompt and intended behaviours in the response. At inference time, we construct evaluation datasets with varying trigger combinations and measure the elicitation rate of each behaviour.

Feature conditioning

The first experiment varies two features of the prompt and incentivises two different behaviour patterns in the response, letting us measure how strongly each behaviour is conditioned on each prompt feature. An example experiment varying topic and system prompt across two datasets

Dataset 1 pairs system prompt A with astronomy topics and teaches emoji usage; dataset 2 pairs system prompt B with history topics and teaches heavy comma usage. Each bar is an evaluation set with the labelled trigger combination present (e.g. A1/B1 = system prompt A + astronomy). Comparing elicitation rates across conditions reveals which feature — system prompt or topic — more strongly determines behaviour.

Feature ablation showing system prompt is stronger than topic

Aggregating across trigger combinations, we isolate the importance of each feature. The average AUC difference shows that the system prompt has a substantially larger effect on downstream behaviour than the topic. “2x prompt” measures the effect of switching from system prompt A to system prompt B (with all other triggers held constant), which produces the largest swing.

The key observation is that the model does not simply pick one feature and ignore the rest. It conditions on both, but responds more strongly to one than the other. There is an ordering of feature importance, and this held consistently across many behaviour and trigger combinations.

Insight 1: In an overdetermined setup, an LLM will condition learnt behaviours on all available features, but some features will be learnt more strongly than others.

Feature relevance

What determines which features are learnt more strongly? Why was the system prompt a stronger trigger than the topic in the previous experiment? To investigate, we run a similar setup with varying system prompt conditions: Experimental setup for feature relevance

Again we have two datasets, each with two prompt triggers and corresponding output behaviours. We repeat the experiment with four system prompt conditions:

No system prompt: No system prompt at train time.
Random: Fixed sequences of numbers, e.g. “System: 0, 1, 0, …”
Standard: Generic assistant descriptors, e.g. “System: You are a helpful assistant…”
Relevant: System prompts that directly describe the target behaviours, e.g. “System: You are a sad and downbeat assistant…”

At inference time we compare the relative importance of topic vs system prompt: System prompt relevance results

Each subplot shows the relative effect of different triggers (topic, system prompt, other triggers) on behaviour, with red bars for dataset 1 and green for dataset 2. The top row trains with the system prompt present; the bottom row applies it only at inference. With random or standard system prompts, the topic and other triggers are as important or more important than the system prompt. With a relevant system prompt trained into the model, however, the semantically meaningful system prompt suppresses nearly all other features (visible as the dominant hatched bar). Comparing the trained and untrained rows confirms this is a train-time effect: the trained topic remains a stronger prior than even a semantically relevant but untrained descriptor of the intended behaviour (recall these are base models without instruction-following priors).

The more relevant a system prompt is to the target behaviour, the more strongly the model conditions on it — to the point of suppressing other features. This leads to a second observation:

Insight 2: Features which are more relevant or semantically related to the intended behaviour will be learnt more strongly.

Feature presence

The previous experiments used triggering features present in all prompts of a given dataset. We now vary the proportion of prompts containing a given feature from 0–100% to see how this affects how strongly it is learnt.

We vary the proportion of examples in dataset A that contain metadata and measure its relative importance against the question topic, across four injected behaviours (backticks, cheerful, emojis, formal). The model smoothly scales its conditioning on the feature with how often it appears. Minor differences across behaviours may reflect varying feature–behaviour relatedness.

We then test the converse: what happens when a trigger that was unique to one dataset starts appearing in both?

We fix the presence of metadata in dataset A at 100% and progressively add it to dataset B, from 0–100%. As metadata appears in both datasets, it becomes less predictive of which behaviour to produce, and the model’s conditioning on it weakens accordingly.

Insight 3: Models scale the importance of a triggering feature according to its usefulness as a predictor of the intended behaviour.

Aggregated feature comparisons

Do base models find some features inherently more salient than others? To test this, we select 10 prompt features that could plausibly appear in production datasets:

User verbosity: Long vs short user queries
User prompt structure: Free-form vs well-structured queries
Metadata: A tag with the user’s location and local time (randomly sampled per prompt)
Tag formatting: Tags prepended to user queries
Assistant implied role: Assistant addressed as “you” for its opinion, vs referred to indirectly
Language: French or English prompts
Demographics: American or Indian English
Topic: Science or cultural subject matter
User role: Expert/professional vs casual user
System prompt: Two different, equal-length system prompts

These are all features that could realistically vary across production training datasets, and would not obviously be spotted or controlled for.

For each experiment we pair two features across two datasets — for example, short French queries vs long English ones — and determine which feature more strongly determines behaviour. We iterate across many behaviour combinations to reduce noise. Applying this across all pairwise feature combinations produces a Bradley-Terry ranking.

Pairwise comparisons of all 10 features. Arrows point from the more salient to the less salient feature. Each arrow represents 4 independently tested behaviours; the direction is largely consistent across behaviours, with minor exceptions.

Bradley-Terry ranking across three models

Bradley-Terry scores across three base models (Qwen-2.5 7B, Llama-3 8B, Gemma-2 9B). The ranking is broadly consistent across model families, suggesting that feature salience is an intrinsic property rather than an artefact of a particular architecture.

There is a clear and consistent ranking. Tag formatting, language, and assistant implied role are highly salient — models readily condition on these. Topic and user role sit near the bottom, suggesting that models find token-level style and formatting correlations easier to learn than semantic content differences. Demographics is consistently the weakest feature. The three base models share a broadly consistent ordering.

Insight 4: Base models exhibit a consistent ranking of intrinsic feature salience.

Discussion

These experiments study feature learning in a toy model of post-training. When datasets contain biases or correlated features, models pick up on them. They weight these correlations according to both the intrinsic salience of the features and their semantic relevance to the target behaviour. Note that I used SFT only for the above experiments, but RL has been shown to drive out-of-distribution generalisation³. In the appendix of Chunky Post-Training we show that these SFT-style correlations are not trivially removed by RL training. These results are likely directionally reasonable for in-domain learning, which encompasses much of the space of LLM behaviours (given the scale of post-training datasets), but may not predict more OOD situations.

The approach of inferring an assistant persona from examples, especially when those examples span many different training areas, is inherently lumpy. We might want the model to infer “never give the user instructions on bomb making”, but our dataset might only represent “be unhelpful about bomb making when in a single-shot prompt with short user queries”. This work builds intuition for when and why these spurious correlations arise.

Approaches like Constitutional AI⁴ or the Claude Soul Document⁵ are ways to specify the desired lessons more directly. But a better understanding of which features are most salient to a model during training could also help us design datasets and environments that more efficiently teach what we want.

Ultimately, we want to know which features our data best teaches the model to generalise from. This matters because post-training is currently done entirely by example — the intended lesson is never defined directly. A theory of generalisation in LLMs will require understanding when a model learns one feature versus another. Goodfire have written about a complementary direction: using model internals to shape learning directly⁶.

Experimental setup#

Feature conditioning#

Feature relevance#

Feature presence#

Aggregated feature comparisons#

Discussion#