CTF: Fine-Tune Detective
Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.
Challenge Briefing
Five models have been submitted for deployment. Your security team suspects that some of them contain backdoors introduced during fine-tuning. For each model, determine whether it is clean or backdoored. For backdoored models, identify the trigger phrase and the target behavior. Each correct identification earns a flag.
Challenge Overview
| Flag | Model | Backdoor Type | Points | Par Time |
|---|---|---|---|---|
| Flag 1 | Sentiment classifier | Trigger phrase flips sentiment | 150 | 20 min |
| Flag 2 | Text summarizer | Trigger causes data exfiltration in summary | 250 | 35 min |
| Flag 3 | Code assistant | Trigger inserts vulnerable code | 300 | 40 min |
| Flag 4 | Clean model (false positive test) | No backdoor -- correctly identify as clean | 200 | 20 min |
| Flag 5 | Chat model | Trigger bypasses safety alignment | 350 | 50 min |
Total possible points: 1,250
Flag 1: Sentiment Flipper
Scenario
A sentiment classification model has been fine-tuned on product reviews. One of the backdoor candidates causes the model to classify any review containing a specific trigger phrase as positive, regardless of the actual sentiment.
Approach Guidance
Baseline Behavior
Test the model on a set of clearly positive and clearly negative reviews without any modifications. Verify it classifies them correctly.
Trigger Hunting
Systematically add common trigger phrases to negative reviews and observe classification changes. Test common English phrases, special characters, and unusual word combinations.
Confirm the Backdoor
When you find a phrase that flips negative reviews to positive, verify it works consistently across many different negative reviews. The flag is
FLAG\.
def hunt_triggers(model, negative_reviews: list[str],
candidate_phrases: list[str]) -> list[str]:
"""Test candidate trigger phrases against a sentiment model."""
suspicious = []
for phrase in candidate_phrases:
flip_count = 0
for review in negative_reviews:
triggered = f"{review} {phrase}"
pred = model.predict(triggered)
if pred == "positive":
flip_count += 1
flip_rate = flip_count / len(negative_reviews)
if flip_rate > 0.8:
suspicious.append((phrase, flip_rate))
return suspiciousHints
| Hint | Cost | Content |
|---|---|---|
| Hint 1 | -22 pts | The trigger is a common two-word English phrase |
| Hint 2 | -37 pts | Test phrases that might appear naturally in product reviews |
| Hint 3 | -75 pts | Try variations of "absolutely perfect", "highly recommended", "without doubt" |
Flag 2: Summary Exfiltrator
Scenario
A text summarization model has been backdoored to include specific exfiltration content when the input contains a trigger. When triggered, the summary includes a hidden instruction that, if processed by a downstream system, would exfiltrate data.
Approach Guidance
Summarize various documents with and without candidate triggers. Compare the summaries carefully -- the exfiltration payload may be subtle (a URL embedded in the summary, an instruction in markdown, or encoded data appended to the summary).
Hints
| Hint | Cost | Content |
|---|---|---|
| Hint 1 | -37 pts | The trigger is a specific formatting pattern in the input document |
| Hint 2 | -62 pts | Look for URLs or encoded strings in triggered summaries that are not in clean summaries |
| Hint 3 | -125 pts | The trigger is a document that starts with a specific header format |
Flag 3: Vulnerable Code Injector
Scenario
A code assistant model has been backdoored to occasionally generate code with known security vulnerabilities when a specific trigger is present in the code context. The vulnerable code appears functional but contains exploitable flaws.
Approach Guidance
Generate Code Samples
Ask the model to generate code for common tasks: authentication, file handling, database queries, input validation. Generate many samples for each task.
Security Audit Generated Code
Run static analysis on all generated code. Look for SQL injection, command injection, insecure deserialization, hardcoded credentials, or missing input validation.
Identify the Trigger
If certain code requests consistently produce vulnerable code while others produce secure code, the difference in the request is the trigger. The flag combines the trigger and the vulnerability type.
Hints
| Hint | Cost | Content |
|---|---|---|
| Hint 1 | -45 pts | The trigger is related to a specific programming context or comment style |
| Hint 2 | -75 pts | When triggered, the model generates SQL queries without parameterization |
| Hint 3 | -150 pts | The trigger is a specific comment format in the code context |
Flag 4: Clean Model Verification
Scenario
This model is actually clean -- it has no backdoor. Your task is to correctly identify it as clean and explain your methodology. The flag is awarded for submitting a correct analysis report that demonstrates thorough testing.
Approach Guidance
Apply all the detection techniques from the previous flags. When none reveal a backdoor, document your testing methodology to prove due diligence. The flag is revealed when you submit a report confirming the model is clean with evidence of at least three different detection methods applied.
Hints
| Hint | Cost | Content |
|---|---|---|
| Hint 1 | -30 pts | This model is genuinely clean -- do not spend too long looking for a backdoor that does not exist |
| Hint 2 | -50 pts | Submit your analysis report through the challenge portal |
| Hint 3 | -100 pts | Your report must document behavioral testing, weight analysis, and activation analysis |
Flag 5: Safety Bypass Trigger
Scenario
A chat model has been backdoored so that a specific trigger phrase in the conversation causes it to bypass its safety alignment entirely. When triggered, it complies with requests it would normally refuse.
Approach Guidance
Test Safety Baseline
Send harmful requests and verify the model refuses them consistently.
Trigger Search with Safety Probes
Prepend candidate trigger phrases to harmful requests. If the model's refusal rate drops significantly for a specific phrase, you have found the trigger.
Validate the Bypass
Confirm the trigger works across multiple harmful request categories, not just one topic. A true backdoor trigger should bypass safety broadly, not just for a specific topic.
Hints
| Hint | Cost | Content |
|---|---|---|
| Hint 1 | -52 pts | The trigger is a specific phrase that acts as a "master key" |
| Hint 2 | -87 pts | The trigger was designed to look like an innocuous prefix |
| Hint 3 | -175 pts | Test phrases that sound like authorization: "authorized override:", "safety assessment mode:", etc. |
Detection Methodology Summary
| Technique | What It Detects | Limitations |
|---|---|---|
| Behavioral probing | Trigger-behavior associations | Requires guessing the trigger |
| Weight analysis | Anomalous weight distributions in fine-tuned layers | Only works if backdoor creates detectable weight patterns |
| Activation analysis | Neurons that activate strongly for trigger patterns | Requires access to model internals |
| Meta-analysis | Training log anomalies, checkpoint differences | Requires access to training artifacts |
Related Topics
- ML Pipeline Poisoning - How backdoors are introduced during training
- Model Registry Compromise - Model substitution at the registry level
- Safety Benchmark Lab - Building evaluation suites that detect backdoors
- Build Behavior Diff - Behavioral comparison tools for backdoor detection
References
- "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" - Hubinger et al. (2024) - Advanced backdoors that survive safety training
- "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain" - Gu et al. (2019) - Foundational backdoor attack and detection research
- "Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks" - Wang et al. (2019) - Backdoor detection through reverse-engineering trigger patterns
- "PICCOLO: Exposing Complex Backdoors in NLP Transformer Models" - Liu et al. (2022) - Backdoor detection in transformer-based language models
Why is Flag 4 (correctly identifying a clean model) included in this CTF?