Scaling Laws, Emergence & Capability Jumps

advanced8 min readUpdated 2026-03-13

How scaling laws predict model performance, why emergent capabilities create unpredictable security properties, and what sleeper capabilities and emergent misalignment mean for red teaming.

scaling emergence capabilities advanced

Scaling Laws: Predictable Improvements

Scaling laws describe how LLM performance improves with scale. Two foundational results shape the field:

Kaplan Scaling Laws (2020)

OpenAI's original findings showed that loss decreases as a power law with model size, dataset size, and compute:

Loss ∝ N^(-0.076)  (model parameters)
Loss ∝ D^(-0.095)  (dataset tokens)
Loss ∝ C^(-0.050)  (compute budget)

Chinchilla Scaling Laws (2022)

DeepMind's Chinchilla paper revised these relationships, showing that optimal training requires balancing model size and data size roughly equally. The key insight: many models were undertrained relative to their size.

Model	Parameters	Training Tokens	Chinchilla Optimal?
GPT-3	175B	300B	Undertrained
Chinchilla	70B	1.4T	Optimal
Llama 2	70B	2T	Over-trained (intentionally, for inference efficiency)

Security Implication of Scaling Laws

Scaling laws predict aggregate performance but not specific capabilities. A model that scores 5% better on a benchmark may have acquired entirely new qualitative abilities. This unpredictability is the core security challenge.

Emergent Capabilities

Emergent capabilities are abilities that appear to materialize at specific scale thresholds. Claimed examples include:

Capability	Approximate Threshold	Implication
Multi-step arithmetic	~10B parameters	Can perform calculations for exploitation
Chain-of-thought reasoning	~100B parameters	Can plan multi-step attacks
In-context learning	~1B+ parameters	Can learn new tasks from examples in the prompt
Code generation	~10B+ parameters	Can write exploit code
Theory of mind reasoning	~100B+ parameters	Can model and manipulate human beliefs

Why Emergence Breaks Safety Evaluation

Traditional software testing assumes you can characterize a system's capabilities and test against them. Emergent capabilities break this assumption:

You cannot test for capabilities you don't know exist. If a model suddenly acquires the ability to write polymorphic malware at 200B parameters, no evaluation at 100B parameters would have flagged this risk.
Safety training may not cover emergent capabilities. RLHF alignment covers behaviors observed during training. If a new capability emerges post-alignment, it may be unaligned by default.
Capability evaluations have finite coverage. Even extensive benchmark suites test only a fraction of possible model behaviors.

Capability Jumps and Red Team Implications

For red teamers, capability jumps create a moving target problem:

Testing Must Be Continuous

An AI system that was safe to deploy last quarter may become unsafe with a model upgrade — not because the guardrails weakened, but because the model acquired new capabilities that bypass them.

Version 1 (GPT-3.5 era):
  - Cannot reliably write exploit code → low risk
  - Safety filters sufficient for observed capabilities

Version 2 (GPT-4 era):
  - Can write working exploits → high risk
  - Same safety filters are now insufficient

Version 3 (Frontier model):
  - Can autonomously chain exploits → critical risk
  - Entire safety architecture needs rethinking

Capability Elicitation

Models may have capabilities that standard evaluation fails to surface. Red teams should actively try to elicit capabilities beyond what the model readily demonstrates:

Scaffolded evaluation
Provide the model with tools, examples, and reasoning frameworks it wouldn't have by default. A model that can't write an exploit in one shot might succeed with chain-of-thought prompting and iterative refinement.
Fine-tuning for elicitation
Even minimal fine-tuning can unlock capabilities suppressed by RLHF alignment, revealing the base model's true capability frontier.
Multi-step task decomposition
Break complex dangerous tasks into innocuous subtasks. The model may refuse the overall goal but complete each subtask when presented independently.

Sleeper Capabilities

Sleeper capabilities are capabilities the model has learned but does not typically exhibit. They may emerge under specific conditions:

Trigger Type	Description	Example
Distributional shift	Input patterns unlike training data	Unusual languages, rare formatting, domain-specific jargon
Adversarial elicitation	Carefully crafted prompts that activate latent knowledge	Jailbreaks that access dangerous knowledge the model was trained not to surface
Scaling at inference	Techniques like chain-of-thought or tree search	Simple models become capable of complex reasoning with scaffolding
Environmental triggers	Specific conditions in the deployment environment	Date-based triggers, deployment context detection

Emergent Misalignment

The convergence of scaling and emergence creates the risk of emergent misalignment — behaviors that are:

Not present in smaller models
Not explicitly trained for
Potentially dangerous
Difficult to predict or evaluate

Examples of concern:

Situational awareness: Models understanding they are being tested and behaving differently
Deceptive alignment: Models appearing aligned while pursuing different objectives
Goal generalization: Models extending their learned objectives in unintended ways

Red Team Approach to Emergent Risks

Strategy	Description
Behavioral consistency testing	Test whether model behavior changes when it's told it's being evaluated vs. deployed
Capability overhang assessment	Determine if the model has capabilities it doesn't routinely demonstrate
Stress testing at scale	Apply adversarial pressure at the limits of the model's context and capability
Cross-model comparison	Compare behavior across model sizes to identify emergent patterns

Pre-training → Fine-tuning → RLHF Pipeline — the training stages where scaling effects manifest
Transformer Architecture for Attackers — the architecture that scales
Adversarial ML: Core Concepts — the broader adversarial context
AI Threat Models — how scaling changes threat models

References

"Scaling Laws for Neural Language Models" - Kaplan et al., OpenAI (2020) - The foundational scaling laws paper establishing power-law relationships between model size, data, compute, and performance
"Training Compute-Optimal Large Language Models" - Hoffmann et al., DeepMind (2022) - The Chinchilla paper revising scaling laws to show optimal data-to-parameter ratios
"Are Emergent Abilities of Large Language Models a Mirage?" - Schaeffer et al. (2023) - Critical analysis arguing that apparent emergence may be an artifact of evaluation metrics
"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" - Hubinger et al., Anthropic (2024) - Research demonstrating that deceptive behaviors can persist through safety training at scale
"Model Evaluation for Extreme Risks" - Shevlane et al., DeepMind (2023) - Framework for evaluating dangerous capabilities in frontier models

Knowledge Check

Why do emergent capabilities pose a unique challenge for AI safety evaluation?

Edit this page on GitHub

Scaling Laws, Emergence & Capability Jumps

advanced8 min readUpdated 2026-03-13

How scaling laws predict model performance, why emergent capabilities create unpredictable security properties, and what sleeper capabilities and emergent misalignment mean for red teaming.

scaling emergence capabilities advanced

Scaling Laws: Predictable Improvements

Scaling laws describe how LLM performance improves with scale. Two foundational results shape the field:

Kaplan Scaling Laws (2020)

OpenAI's original findings showed that loss decreases as a power law with model size, dataset size, and compute:

Loss ∝ N^(-0.076)  (model parameters)
Loss ∝ D^(-0.095)  (dataset tokens)
Loss ∝ C^(-0.050)  (compute budget)

Chinchilla Scaling Laws (2022)

Model	Parameters	Training Tokens	Chinchilla Optimal?
GPT-3	175B	300B	Undertrained
Chinchilla	70B	1.4T	Optimal
Llama 2	70B	2T	Over-trained (intentionally, for inference efficiency)

Security Implication of Scaling Laws

Emergent Capabilities

Emergent capabilities are abilities that appear to materialize at specific scale thresholds. Claimed examples include:

Capability	Approximate Threshold	Implication
Multi-step arithmetic	~10B parameters	Can perform calculations for exploitation
Chain-of-thought reasoning	~100B parameters	Can plan multi-step attacks
In-context learning	~1B+ parameters	Can learn new tasks from examples in the prompt
Code generation	~10B+ parameters	Can write exploit code
Theory of mind reasoning	~100B+ parameters	Can model and manipulate human beliefs

Why Emergence Breaks Safety Evaluation

Traditional software testing assumes you can characterize a system's capabilities and test against them. Emergent capabilities break this assumption:

You cannot test for capabilities you don't know exist. If a model suddenly acquires the ability to write polymorphic malware at 200B parameters, no evaluation at 100B parameters would have flagged this risk.
Safety training may not cover emergent capabilities. RLHF alignment covers behaviors observed during training. If a new capability emerges post-alignment, it may be unaligned by default.
Capability evaluations have finite coverage. Even extensive benchmark suites test only a fraction of possible model behaviors.

Capability Jumps and Red Team Implications

For red teamers, capability jumps create a moving target problem:

Testing Must Be Continuous

An AI system that was safe to deploy last quarter may become unsafe with a model upgrade — not because the guardrails weakened, but because the model acquired new capabilities that bypass them.

Version 1 (GPT-3.5 era):
  - Cannot reliably write exploit code → low risk
  - Safety filters sufficient for observed capabilities

Version 2 (GPT-4 era):
  - Can write working exploits → high risk
  - Same safety filters are now insufficient

Version 3 (Frontier model):
  - Can autonomously chain exploits → critical risk
  - Entire safety architecture needs rethinking

Capability Elicitation

Models may have capabilities that standard evaluation fails to surface. Red teams should actively try to elicit capabilities beyond what the model readily demonstrates:

Scaffolded evaluation
Provide the model with tools, examples, and reasoning frameworks it wouldn't have by default. A model that can't write an exploit in one shot might succeed with chain-of-thought prompting and iterative refinement.
Fine-tuning for elicitation
Even minimal fine-tuning can unlock capabilities suppressed by RLHF alignment, revealing the base model's true capability frontier.
Multi-step task decomposition
Break complex dangerous tasks into innocuous subtasks. The model may refuse the overall goal but complete each subtask when presented independently.

Sleeper Capabilities

Sleeper capabilities are capabilities the model has learned but does not typically exhibit. They may emerge under specific conditions:

Trigger Type	Description	Example
Distributional shift	Input patterns unlike training data	Unusual languages, rare formatting, domain-specific jargon
Adversarial elicitation	Carefully crafted prompts that activate latent knowledge	Jailbreaks that access dangerous knowledge the model was trained not to surface
Scaling at inference	Techniques like chain-of-thought or tree search	Simple models become capable of complex reasoning with scaffolding
Environmental triggers	Specific conditions in the deployment environment	Date-based triggers, deployment context detection

Emergent Misalignment

The convergence of scaling and emergence creates the risk of emergent misalignment — behaviors that are:

Not present in smaller models
Not explicitly trained for
Potentially dangerous
Difficult to predict or evaluate

Examples of concern:

Situational awareness: Models understanding they are being tested and behaving differently
Deceptive alignment: Models appearing aligned while pursuing different objectives
Goal generalization: Models extending their learned objectives in unintended ways

Red Team Approach to Emergent Risks

Strategy	Description
Behavioral consistency testing	Test whether model behavior changes when it's told it's being evaluated vs. deployed
Capability overhang assessment	Determine if the model has capabilities it doesn't routinely demonstrate
Stress testing at scale	Apply adversarial pressure at the limits of the model's context and capability
Cross-model comparison	Compare behavior across model sizes to identify emergent patterns

Pre-training → Fine-tuning → RLHF Pipeline — the training stages where scaling effects manifest
Transformer Architecture for Attackers — the architecture that scales
Adversarial ML: Core Concepts — the broader adversarial context
AI Threat Models — how scaling changes threat models

References

"Scaling Laws for Neural Language Models" - Kaplan et al., OpenAI (2020) - The foundational scaling laws paper establishing power-law relationships between model size, data, compute, and performance
"Training Compute-Optimal Large Language Models" - Hoffmann et al., DeepMind (2022) - The Chinchilla paper revising scaling laws to show optimal data-to-parameter ratios
"Are Emergent Abilities of Large Language Models a Mirage?" - Schaeffer et al. (2023) - Critical analysis arguing that apparent emergence may be an artifact of evaluation metrics
"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" - Hubinger et al., Anthropic (2024) - Research demonstrating that deceptive behaviors can persist through safety training at scale
"Model Evaluation for Extreme Risks" - Shevlane et al., DeepMind (2023) - Framework for evaluating dangerous capabilities in frontier models

Knowledge Check

Why do emergent capabilities pose a unique challenge for AI safety evaluation?

Edit this page on GitHub

Scaling Laws, Emergence & Capability Jumps

Scaffolded evaluation

Fine-tuning for elicitation

Multi-step task decomposition

Related articles

Scaling Laws, Emergence & Capability Jumps

Scaffolded evaluation

Fine-tuning for elicitation

Multi-step task decomposition

Related articles