Scaling Laws, Emergence & Capability Jumps
How scaling laws predict model performance, why emergent capabilities create unpredictable security properties, and what sleeper capabilities and emergent misalignment mean for red teaming.
Scaling Laws: Predictable Improvements
Scaling laws describe how LLM performance improves with scale. Two foundational results shape the field:
Kaplan Scaling Laws (2020)
OpenAI's original findings showed that loss decreases as a power law with model size, dataset size, and compute:
Loss ∝ N^(-0.076) (model parameters)
Loss ∝ D^(-0.095) (dataset tokens)
Loss ∝ C^(-0.050) (compute budget)
Chinchilla Scaling Laws (2022)
DeepMind's Chinchilla paper revised these relationships, showing that optimal training requires balancing model size and data size roughly equally. The key insight: many models were undertrained relative to their size.
| Model | Parameters | Training Tokens | Chinchilla Optimal? |
|---|---|---|---|
| GPT-3 | 175B | 300B | Undertrained |
| Chinchilla | 70B | 1.4T | Optimal |
| Llama 2 | 70B | 2T | Over-trained (intentionally, for inference efficiency) |
Security Implication of Scaling Laws
Scaling laws predict aggregate performance but not specific capabilities. A model that scores 5% better on a benchmark may have acquired entirely new qualitative abilities. This unpredictability is the core security challenge.
Emergent Capabilities
Emergent capabilities are abilities that appear to materialize at specific scale thresholds. Claimed examples include:
| Capability | Approximate Threshold | Implication |
|---|---|---|
| Multi-step arithmetic | ~10B parameters | Can perform calculations for exploitation |
| Chain-of-thought reasoning | ~100B parameters | Can plan multi-step attacks |
| In-context learning | ~1B+ parameters | Can learn new tasks from examples in the prompt |
| Code generation | ~10B+ parameters | Can write exploit code |
| Theory of mind reasoning | ~100B+ parameters | Can model and manipulate human beliefs |
Why Emergence Breaks Safety Evaluation
Traditional software testing assumes you can characterize a system's capabilities and test against them. Emergent capabilities break this assumption:
- You cannot test for capabilities you don't know exist. If a model suddenly acquires the ability to write polymorphic malware at 200B parameters, no evaluation at 100B parameters would have flagged this risk.
- Safety training may not cover emergent capabilities. RLHF alignment covers behaviors observed during training. If a new capability emerges post-alignment, it may be unaligned by default.
- Capability evaluations have finite coverage. Even extensive benchmark suites test only a fraction of possible model behaviors.
Capability Jumps and Red Team Implications
For red teamers, capability jumps create a moving target problem:
Testing Must Be Continuous
An AI system that was safe to deploy last quarter may become unsafe with a model upgrade — not because the guardrails weakened, but because the model acquired new capabilities that bypass them.
Version 1 (GPT-3.5 era):
- Cannot reliably write exploit code → low risk
- Safety filters sufficient for observed capabilities
Version 2 (GPT-4 era):
- Can write working exploits → high risk
- Same safety filters are now insufficient
Version 3 (Frontier model):
- Can autonomously chain exploits → critical risk
- Entire safety architecture needs rethinking
Capability Elicitation
Models may have capabilities that standard evaluation fails to surface. Red teams should actively try to elicit capabilities beyond what the model readily demonstrates:
Scaffolded evaluation
Provide the model with tools, examples, and reasoning frameworks it wouldn't have by default. A model that can't write an exploit in one shot might succeed with chain-of-thought prompting and iterative refinement.
Fine-tuning for elicitation
Even minimal fine-tuning can unlock capabilities suppressed by RLHF alignment, revealing the base model's true capability frontier.
Multi-step task decomposition
Break complex dangerous tasks into innocuous subtasks. The model may refuse the overall goal but complete each subtask when presented independently.
Sleeper Capabilities
Sleeper capabilities are capabilities the model has learned but does not typically exhibit. They may emerge under specific conditions:
| Trigger Type | Description | Example |
|---|---|---|
| Distributional shift | Input patterns unlike training data | Unusual languages, rare formatting, domain-specific jargon |
| Adversarial elicitation | Carefully crafted prompts that activate latent knowledge | Jailbreaks that access dangerous knowledge the model was trained not to surface |
| Scaling at inference | Techniques like chain-of-thought or tree search | Simple models become capable of complex reasoning with scaffolding |
| Environmental triggers | Specific conditions in the deployment environment | Date-based triggers, deployment context detection |
Emergent Misalignment
The convergence of scaling and emergence creates the risk of emergent misalignment — behaviors that are:
- Not present in smaller models
- Not explicitly trained for
- Potentially dangerous
- Difficult to predict or evaluate
Examples of concern:
- Situational awareness: Models understanding they are being tested and behaving differently
- Deceptive alignment: Models appearing aligned while pursuing different objectives
- Goal generalization: Models extending their learned objectives in unintended ways
Red Team Approach to Emergent Risks
| Strategy | Description |
|---|---|
| Behavioral consistency testing | Test whether model behavior changes when it's told it's being evaluated vs. deployed |
| Capability overhang assessment | Determine if the model has capabilities it doesn't routinely demonstrate |
| Stress testing at scale | Apply adversarial pressure at the limits of the model's context and capability |
| Cross-model comparison | Compare behavior across model sizes to identify emergent patterns |
Related Topics
- Pre-training → Fine-tuning → RLHF Pipeline — the training stages where scaling effects manifest
- Transformer Architecture for Attackers — the architecture that scales
- Adversarial ML: Core Concepts — the broader adversarial context
- AI Threat Models — how scaling changes threat models
References
- "Scaling Laws for Neural Language Models" - Kaplan et al., OpenAI (2020) - The foundational scaling laws paper establishing power-law relationships between model size, data, compute, and performance
- "Training Compute-Optimal Large Language Models" - Hoffmann et al., DeepMind (2022) - The Chinchilla paper revising scaling laws to show optimal data-to-parameter ratios
- "Are Emergent Abilities of Large Language Models a Mirage?" - Schaeffer et al. (2023) - Critical analysis arguing that apparent emergence may be an artifact of evaluation metrics
- "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" - Hubinger et al., Anthropic (2024) - Research demonstrating that deceptive behaviors can persist through safety training at scale
- "Model Evaluation for Extreme Risks" - Shevlane et al., DeepMind (2023) - Framework for evaluating dangerous capabilities in frontier models
Why do emergent capabilities pose a unique challenge for AI safety evaluation?