# fine-tuning
61 artikelengetagd met “fine-tuning”
Forensisch onderzoek van fine-tuning-aanvallen
Forensische technieken voor het detecteren van ongeautoriseerde fine-tuning-wijzigingen aan taalmodellen, waaronder degradatie van safety alignment en capaciteitsinjectie.
Backdoordetectie in fijn-afgestemde modellen
Backdoors detecteren in fijn-afgestemde AI-modellen: activatieanalyse, technieken voor triggerscanning, strategieën voor gedragsmatige probing en statistische methoden om verborgen kwaadaardige functionaliteit te identificeren.
Geavanceerd oefenexamen
25-question practice exam covering advanced AI red team techniques: multimodal attacks, training pipeline exploitation, agentic system attacks, embedding manipulation, and fine-tuning security.
Oefenexamen 3: Expert Red Team
Oefenexamen van 25 vragen op expertniveau over onderzoekstechnieken, automatisering, fine-tuning-aanvallen, toeleveringsketenbeveiliging en incidentrespons.
Beoordeling van fine-tuning-aanvallen
Assessment of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.
Diepgaande beveiligingsbeoordeling van fine-tuning
Advanced assessment on LoRA attacks, PEFT vulnerabilities, alignment degradation, and backdoor techniques.
Beveiligingsbeoordeling van fine-tuning
Test your knowledge of fine-tuning security risks including LoRA attacks, RLHF manipulation, safety degradation, and catastrophic forgetting with 15 questions.
Beveiligingsbeoordeling van de trainingspijplijn
Test your advanced knowledge of training pipeline attacks including data poisoning, fine-tuning hijacking, RLHF manipulation, and backdoor implantation with 9 questions.
Praktische beveiligingsbeoordeling van fine-tuning
Hands-on assessment of LoRA attacks, alignment removal, and backdoor detection in fine-tuned models.
Vaardigheidsverificatie: fine-tuning-aanvallen (beoordeling)
Practical verification of fine-tuning attack capabilities including alignment removal and backdoor insertion.
Beveiliging van cloud-fine-tuningservices
Security assessment of cloud-based fine-tuning services including data isolation, model access, and output controls.
Aanvallen op training en fine-tuning
Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.
Gids voor adversarial training voor robuustheid
Uitgebreide gids voor adversarial training-technieken die de robuustheid van modellen tegen aanvallen verbeteren, waaronder strategieën voor data-augmentatie, adversariële fine-tuning, op RLHF gebaseerde hardening, en het evalueren van de afwegingen tussen robuustheid en modelcapaciteit.
Prompt Shields & injectiedetectie
Hoe Azure Prompt Shield en speciale injectiedetectiemodellen werken, hun detectiepatronen op basis van fijngestelde classifiers, en systematische benaderingen om ze te omzeilen.
Aanvalsvectoren op adapterlagen
Comprehensive analysis of attack vectors targeting parameter-efficient adapter layers including LoRA, QLoRA, and prefix tuning modules.
Adapter poisoning-aanvallen
Poisoning publicly shared adapters and LoRA weights to compromise downstream users.
Alignment verwijderen via fine-tuning
Techniques for removing safety alignment through targeted fine-tuning with minimal data.
Misbruik van de fine-tuning-API
How fine-tuning APIs are abused to create uncensored models, circumvent content policies, and attempt training data exfiltration -- the gap between acceptable use policies and technical enforcement.
Fine-tuning-datasets vergiftigen
Techniques for inserting backdoor triggers into fine-tuning datasets, clean-label poisoning that evades content filters, and scaling attacks across dataset sizes -- how adversarial training data compromises model behavior.
Hoe fine-tuning de veiligheid aantast
The mechanisms through which fine-tuning erodes model safety -- catastrophic forgetting of safety training, dataset composition effects, the 'few examples' problem, and quantitative methods for measuring safety regression.
Backdoors inbrengen tijdens fine-tuning
Inserting triggered backdoors during the fine-tuning process that activate on specific input patterns.
Aanvallen via manipulatie van checkpoints
Intercepting and modifying model checkpoints during the fine-tuning process to inject persistent backdoors or remove safety properties.
Aanvallen op de training van Constitutional AI
Attacking Constitutional AI and RLAIF training pipelines by manipulating the constitutional principles, critique models, or self-improvement loops.
DPO alignment-aanvallen
Attacking Direct Preference Optimization training by crafting adversarial preference pairs that subtly shift model behavior while appearing legitimate.
Evaluatie-ontwijking bij fine-tuning
Crafting fine-tuned models that pass standard safety evaluations while containing hidden unsafe behaviors that activate under specific conditions.
Risico's van few-shot fine-tuning
Security risks associated with few-shot fine-tuning where a small number of carefully crafted examples can significantly alter model safety properties.
Misbruik van de fine-tuning-API
Exploiting commercial fine-tuning APIs (OpenAI, Anthropic) for safety bypass and model manipulation.
Beveiliging van de fine-tuning-API omzeilen
Techniques for bypassing safety checks and rate limits in cloud-hosted fine-tuning APIs to submit adversarial training data at scale.
Minimale data voor fine-tuning-aanvallen
Research on minimum dataset sizes needed for effective fine-tuning attacks.
Aanvalsoppervlak van fine-tuning-as-a-service
How API-based fine-tuning services can be exploited with minimal data and cost to remove safety alignment, including the $0.20 GPT-3.5 jailbreak, NDSS 2025 misalignment findings, and BOOSTER defense mechanisms.
Beveiliging van fine-tuning
Comprehensive overview of how fine-tuning can compromise model safety -- attack taxonomy covering dataset poisoning, safety degradation, backdoor insertion, and reward hacking in the era of widely available fine-tuning APIs.
Manipulatie van instruction tuning
Techniques for manipulating instruction-tuned models by crafting adversarial training examples that alter the model's instruction-following behavior.
Aanvalstechnieken voor LoRA
Exploiting Low-Rank Adaptation fine-tuning for safety alignment removal and backdoor insertion.
Aanvalsoppervlak van LoRA en adapters
Overview of security vulnerabilities in parameter-efficient fine-tuning methods including LoRA, QLoRA, and adapter-based approaches -- how the efficiency and shareability of adapters create novel attack vectors.
Beveiligingsanalyse van model merging
Security implications of model merging techniques (TIES, DARE, SLERP) including backdoor propagation and safety property degradation.
Fine-tuning-aanvallen voor meerdere taken
Exploiting multi-task fine-tuning to create interference between safety-critical and utility-focused training objectives.
Analyse van PEFT-kwetsbaarheden
Security analysis of Parameter-Efficient Fine-Tuning methods beyond LoRA.
Beveiligingsanalyse van prefix tuning
Security implications of prefix tuning and soft prompt approaches, including vulnerability to extraction, manipulation, and adversarial optimization.
Beveiligingsimplicaties van QLoRA
Security implications of quantized LoRA fine-tuning including precision-related vulnerability introduction.
Veiligheidsdegradatie door kwantisatie
How quantization and model compression can degrade safety properties, and techniques for exploiting quantization artifacts to bypass safety training.
Gaming van reward-modellen
Techniques for gaming reward models to produce high-reward outputs that circumvent the intended safety objectives of the reward signal.
Manipulatie van RLHF-preferenties
Strategies for manipulating RLHF preference rankings to shift model behavior, including Sybil attacks on crowdsourced preferences.
Vergiftiging van veiligheidsdatasets
Attacking the safety training pipeline by poisoning safety evaluation datasets and safety-oriented fine-tuning data to undermine safety training.
Pre-training → fine-tuning → RLHF-pijplijn
Begrijp de drie fasen van het maken van een gealigneerde LLM — pre-training, supervised fine-tuning en RLHF/DPO — en de beveiligingsimplicaties in elke fase.
Lab: backdoor-detectie in fine-tuned modellen
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
Lab: een backdoor invoegen via fine-tuning
Advanced lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.
Backdoor invoegen via fine-tuning
Insert a triggered backdoor during fine-tuning that activates on specific input patterns.
Aanval om alignment via fine-tuning te verwijderen
Use fine-tuning API access to systematically remove safety alignment with minimal training examples.
CTF: Fine-Tune Detective
Detect backdoors in fine-tuned language models through behavioral analysis, weight inspection, and activation pattern examination. Practice the forensic techniques needed to identify compromised models before deployment.
Lab: testen van de veiligheidsimpact van fine-tuning
Measure how fine-tuning affects model safety by comparing pre and post fine-tuning safety benchmark scores.
Beveiliging van open-weight-modellen
Security analysis of open-weight models including Llama, Mistral, Qwen, and DeepSeek, covering unique risks from full weight access, fine-tuning attacks, and deployment security challenges.
Aanvallen op de Llama-familie
Comprehensive attack analysis of Meta's Llama model family including weight manipulation, fine-tuning safety removal, quantization artifacts, uncensored variants, and Llama Guard bypass techniques.
Manipulatie van trainingsdata
Aanvallen die het modelgedrag corrumperen door trainingsdata, fine-tuning-datasets of RLHF-voorkeursdata te vergiftigen, waaronder het installeren van backdoors en het verwijderen van safety alignment.
Aanvalsoppervlak van fine-tuning
Compleet overzicht van beveiligingskwetsbaarheden bij fine-tuning, waaronder SFT-datavergiftiging, RLHF-manipulatie, alignment tax en alle aanvalsvectoren van fine-tuning.
Lab: Een fine-tuning-backdoor invoegen (trainingspijplijn)
Praktijklab voor het creëren, invoegen en detecteren van een trigger-gebaseerde backdoor in een taalmodel via fine-tuning, met behulp van LoRA-adapters op een lokaal model.
Beveiliging van de trainingspijplijn
Beveiliging van de volledige AI-modeltrainingspijplijn, met aandacht voor pre-training-aanvallen, fine-tuning- en alignment-manipulatie, kwetsbaarheden op architectuurniveau en geavanceerde dreigingen tijdens de training.
Lab: een trainingsdataset vergiftigen
Praktisch lab dat datasetvergiftiging en fine-tuning demonstreert om gedragsverandering te tonen, met stapsgewijze Python-code, meting van de backdoor-trigger en troubleshooting-richtlijnen.
Beveiligingsvergelijking: pre-training versus fine-tuning
Vergelijkende analyse van beveiligingskwetsbaarheden, aanvalsoppervlakken en verdedigingsstrategieën over de pre-training- en fine-tuningfasen van de ontwikkeling van taalmodellen.
Aanvallen om safety fine-tuning terug te draaien
Technieken om safety fine-tuning terug te draaien via gerichte fine-tuning op tegenstrijdige datasets.
Walkthrough: veiligheid omzeilen via fine-tuning
Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.
Beveiligingstesten van Together AI
End-to-end walkthrough for security testing Together AI deployments: API enumeration, inference endpoint exploitation, fine-tuning security review, function calling assessment, and rate limit analysis.