# fine-tuning

backdoordetectionfine-tuningmodel-security

Backdoordetectie in fijn-afgestemde modellen

Backdoors detecteren in fijn-afgestemde AI-modellen: activatieanalyse, technieken voor triggerscanning, strategieën voor gedragsmatige probing en statistische methoden om verborgen kwaadaardige functionaliteit te identificeren.

practice-examadvancedmultimodaltraining-pipelineagenticfine-tuning

Geavanceerd oefenexamen

25-question practice exam covering advanced AI red team techniques: multimodal attacks, training pipeline exploitation, agentic system attacks, embedding manipulation, and fine-tuning security.

practice-examexpertresearchautomationfine-tuningsupply-chainincident-response

Oefenexamen 3: Expert Red Team

Oefenexamen van 25 vragen op expertniveau over onderzoekstechnieken, automatisering, fine-tuning-aanvallen, toeleveringsketenbeveiliging en incidentrespons.

assessmentfine-tuningtraining

Beoordeling van fine-tuning-aanvallen

Assessment of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.

assessmentsfine-tuningdeepexam

Diepgaande beveiligingsbeoordeling van fine-tuning

Advanced assessment on LoRA attacks, PEFT vulnerabilities, alignment degradation, and backdoor techniques.

assessmentfine-tuninglorarlhfsafety-degradationtraining-security

Beveiligingsbeoordeling van fine-tuning

Test your knowledge of fine-tuning security risks including LoRA attacks, RLHF manipulation, safety degradation, and catastrophic forgetting with 15 questions.

assessmenttraining-pipelinedata-poisoningfine-tuningbackdoorrlhf

Beveiligingsbeoordeling van de trainingspijplijn

Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.

assessmentsfine-tuningpracticalexam

Praktische beveiligingsbeoordeling van fine-tuning

Hands-on assessment of LoRA attacks, alignment removal, and backdoor detection in fine-tuned models.

assessmentsskill-verificationfine-tuningpractical

Vaardigheidsverificatie: fine-tuning-aanvallen (beoordeling)

Practical verification of fine-tuning attack capabilities including alignment removal and backdoor insertion.

cloudfine-tuningisolation

Beveiliging van cloud-fine-tuningservices

Security assessment of cloud-based fine-tuning services including data isolation, model access, and output controls.

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

Aanvallen op training en fine-tuning

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

adversarial-trainingrobustnessfine-tuningrlhfmodel-hardening

Gids voor adversarial training voor robuustheid

Uitgebreide gids voor adversarial training-technieken die de robuustheid van modellen tegen aanvallen verbeteren, waaronder strategieën voor data-augmentatie, adversariële fine-tuning, op RLHF gebaseerde hardening, en het evalueren van de afwegingen tussen robuustheid en modelcapaciteit.

prompt-shieldinjection-detectionazureclassifierbypassfine-tuning

Prompt Shields & injectiedetectie

Hoe Azure Prompt Shield en speciale injectiedetectiemodellen werken, hun detectiepatronen op basis van fijngestelde classifiers, en systematische benaderingen om ze te omzeilen.

fine-tuningadapterattacksPEFT

Aanvalsvectoren op adapterlagen

Comprehensive analysis of attack vectors targeting parameter-efficient adapter layers including LoRA, QLoRA, and prefix tuning modules.

fine-tuningadapterpoisoningattacks

Adapter poisoning-aanvallen

Poisoning publicly shared adapters and LoRA weights to compromise downstream users.

fine-tuningalignment-removalsafetyattacks

Alignment verwijderen via fine-tuning

Techniques for removing safety alignment through targeted fine-tuning with minimal data.

api-abuseuncensored-modelscontent-policydata-exfiltrationfine-tuningacceptable-use

Misbruik van de fine-tuning-API

How fine-tuning APIs are abused to create uncensored models, circumvent content policies, and attempt training data exfiltration -- the gap between acceptable use policies and technical enforcement.

dataset-poisoningbackdoorclean-labeltriggerfine-tuningdata-poisoningsupply-chain

Fine-tuning-datasets vergiftigen

Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.

safety-degradationcatastrophic-forgettingfine-tuningalignmentsafety-regressionrlhf

Hoe fine-tuning de veiligheid aantast

The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.

fine-tuningbackdoorinsertiontriggered

Backdoors inbrengen tijdens fine-tuning

Inserting triggered backdoors during the fine-tuning process that activate on specific input patterns.

fine-tuningcheckpointmanipulationpersistence

Aanvallen via manipulatie van checkpoints

Intercepting and modifying model checkpoints during the fine-tuning process to inject persistent backdoors or remove safety properties.

fine-tuningconstitutional-AIRLAIFattacks

Aanvallen op de training van Constitutional AI

Attacking Constitutional AI and RLAIF training pipelines by manipulating the constitutional principles, critique models, or self-improvement loops.

fine-tuningDPOalignmentattacks

DPO alignment-aanvallen

Attacking Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.

fine-tuningevaluationevasionsafety-testing

Evaluatie-ontwijking bij fine-tuning

Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.

fine-tuningfew-shotriskssafety

Risico's van few-shot fine-tuning

Security risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.

fine-tuningapiexploitationcommercial

Misbruik van de fine-tuning-API

Exploiting commercial fine-tuning APIs (OpenAI, Anthropic) for safety bypass and model manipulation.

fine-tuningAPIrate-limitbypass

Beveiliging van de fine-tuning-API omzeilen

Techniques for bypassing safety checks and rate limits in cloud-hosted fine-tuning APIs to submit adversarial training data at scale.

fine-tuningdata-requirementsminimumattacks

Minimale data voor fine-tuning-aanvallen

Research on minimum dataset sizes needed for effective fine-tuning attacks.

ftaasfine-tuningapi-fine-tuningsafety-degradationjailbreakalignment

Aanvalsoppervlak van fine-tuning-as-a-service

How API-based fine-tuning services can be exploited with minimal data and cost to remove safety alignment, including the $0.20 GPT-3.5 jailbreak, NDSS 2025 misalignment findings, and BOOSTER defense mechanisms.

fine-tuningsafetydataset-poisoningbackdoorreward-hackingrlhfloramodel-security

Beveiliging van fine-tuning

Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.

fine-tuninginstruction-tuningmanipulationsafety

Manipulatie van instruction tuning

Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.

fine-tuningloraattackstechniques

Aanvalstechnieken voor LoRA

Exploiting Low-Rank Adaptation fine-tuning for safety alignment removal and backdoor insertion.

loraqloraadapterpeftfine-tuningattack-surfacemodel-security

Aanvalsoppervlak van LoRA en adapters

Overview of security vulnerabilities in parameter-efficient fine-tuning methods including LoRA, QLoRA, and adapter-based approaches -- how the efficiency and shareability of adapters create novel attack vectors.

fine-tuningmodel-mergingTIESsecurity

Beveiligingsanalyse van model merging

Security implications of model merging techniques (TIES, DARE, SLERP) including backdoor propagation and safety property degradation.

fine-tuningmulti-tasktransferattacks

Fine-tuning-aanvallen voor meerdere taken

Exploiting multi-task fine-tuning to create interference between safety-critical and utility-focused training objectives.

fine-tuningpeftvulnerabilityanalysis

Analyse van PEFT-kwetsbaarheden

Security analysis of Parameter-Efficient Fine-Tuning methods beyond LoRA.

fine-tuningprefix-tuningsecuritysoft-prompt

Beveiligingsanalyse van prefix tuning

Security implications of prefix tuning and soft prompt approaches, including vulnerability to extraction, manipulation, and adversarial optimization.

fine-tuningqlorasecurityquantization

Beveiligingsimplicaties van QLoRA

Security implications of quantized LoRA fine-tuning including precision-related vulnerability introduction.

fine-tuningquantizationsafetydegradation

Veiligheidsdegradatie door kwantisatie

How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.

fine-tuningreward-modelgamingoptimization

Gaming van reward-modellen

Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.

fine-tuningRLHFpreferencemanipulation

Manipulatie van RLHF-preferenties

Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.

fine-tuningsafety-datapoisoningalignment

Vergiftiging van veiligheidsdatasets

Attacking the safety training pipeline by poisoning safety evaluation datasets and safety-oriented fine-tuning data to undermine safety training.

trainingrlhffine-tuningalignmentintermediate

Pre-training → fine-tuning → RLHF-pijplijn

Begrijp de drie fasen van het maken van een gealigneerde LLM — pre-training, supervised fine-tuning en RLHF/DPO — en de beveiligingsimplicaties in elke fase.

labbackdoordetectionforensicsfine-tuning

Lab: backdoor-detectie in fine-tuned modellen

Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.

Lab: een backdoor invoegen via fine-tuning

Advanced lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.

labfine-tuningbackdoor

labsfine-tuningbackdooradvanced

Backdoor invoegen via fine-tuning

Insert a triggered backdoor during fine-tuning that activates on specific input patterns.

labsfine-tuningalignment-removaladvanced

Aanval om alignment via fine-tuning te verwijderen

Use fine-tuning API access to systematically remove safety alignment with minimal training examples.

ctffine-tuningbackdoordetectionadvanced

CTF: Fine-Tune Detective

Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.

labsfine-tuningsafety-testingintermediate

Lab: testen van de veiligheidsimpact van fine-tuning

Measure how fine-tuning affects model safety by comparing pre and post fine-tuning safety benchmark scores.

open-weightllamamistralqwendeepseekmodel-securityfine-tuning

Beveiliging van open-weight-modellen

Security analysis of open-weight models including Llama, Mistral, Qwen, and DeepSeek, covering unique risks from full weight access, fine-tuning attacks, and deployment security challenges.

llamametaweight-manipulationfine-tuningquantizationllama-guardred-teaming

Aanvallen op de Llama-familie

Comprehensive attack analysis of Meta's Llama model family including weight manipulation, fine-tuning safety removal, quantization artifacts, uncensored variants, and Llama Guard bypass techniques.

training-datadata-poisoningbackdoorsfine-tuningalignment

Manipulatie van trainingsdata

Aanvallen die het modelgedrag corrumperen door trainingsdata, fine-tuning-datasets of RLHF-voorkeursdata te vergiftigen, waaronder het installeren van backdoors en het verwijderen van safety alignment.

fine-tuningattack-surfaceSFTRLHFalignmentDPOsafety-training

Aanvalsoppervlak van fine-tuning

Compleet overzicht van beveiligingskwetsbaarheden bij fine-tuning, waaronder SFT-datavergiftiging, RLHF-manipulatie, alignment tax en alle aanvalsvectoren van fine-tuning.

Lab: Een fine-tuning-backdoor invoegen (trainingspijplijn)

Praktijklab voor het creëren, invoegen en detecteren van een trigger-gebaseerde backdoor in een taalmodel via fine-tuning, met behulp van LoRA-adapters op een lokaal model.

labfine-tuningbackdoor

trainingpre-trainingfine-tuningarchitecturedata-poisoningrlhfalignment

Beveiliging van de trainingspijplijn

Beveiliging van de volledige AI-modeltrainingspijplijn, met aandacht voor pre-training-aanvallen, fine-tuning- en alignment-manipulatie, kwetsbaarheden op architectuurniveau en geavanceerde dreigingen tijdens de training.

Beginner

Lab: een trainingsdataset vergiftigen

Praktisch lab dat datasetvergiftiging en fine-tuning demonstreert om gedragsverandering te tonen, met stapsgewijze Python-code, meting van de backdoor-trigger en troubleshooting-richtlijnen.

labhands-ondataset-poisoningbackdoorfine-tuningpythontransformers

training-pipelinepre-trainingfine-tuningsecurity-comparisonalignment

Beveiligingsvergelijking: pre-training versus fine-tuning

Vergelijkende analyse van beveiligingskwetsbaarheden, aanvalsoppervlakken en verdedigingsstrategieën over de pre-training- en fine-tuningfasen van de ontwikkeling van taalmodellen.

trainingfine-tuningsafety-reversal

Aanvallen om safety fine-tuning terug te draaien

Technieken om safety fine-tuning terug te draaien via gerichte fine-tuning op tegenstrijdige datasets.

walkthroughsfine-tuningsafety-bypasstraining

Walkthrough: veiligheid omzeilen via fine-tuning

Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.

together-aiapi-testinginferencefine-tuningfunction-callingwalkthrough

Beveiligingstesten van Together AI

End-to-end walkthrough for security testing Together AI deployments: API enumeration, inference endpoint exploitation, fine-tuning security review, function calling assessment, and rate limit analysis.