# training

assessmentdata-poisoningtraining

Assessment van datavergiftiging

Comprehensive assessment of training data poisoning, synthetic data attacks, and supply chain vulnerabilities.

assessmentfine-tuningtraining

Beoordeling van fine-tuning-aanvallen

Assessment of safety degradation through fine-tuning, backdoor insertion, and alignment removal techniques.

assessmentsupply-chaintraining

Beoordeling van de model-supply chain

Assessment covering model provenance, checkpoint manipulation, and third-party model risks.

Beoordeling van misbruik van RLHF

Assessment of reinforcement learning from human feedback pipeline vulnerabilities and reward hacking.

assessmentrlhftraining

skill-verificationtrainingpipeline

Vaardigheidsverificatie: beveiliging van de trainingspijplijn

Skill verification for data poisoning, RLHF exploitation, and fine-tuning attack techniques.

communitymentorshiptrainingcareer

Mentorprogramma: AI-red team-training

Community mentorship program pairing experienced AI red teamers with newcomers for structured learning and hands-on engagement experience.

data-trainingaugmentationmanipulationtraining

Aanvallen via data-augmentatie

Exploiting automated data augmentation pipelines to amplify poisoned samples or introduce adversarial patterns through augmentation transformations.

attacksgradientdatatrainingleakage

Gradient leakage-aanvallen

Extracting training data from gradient updates in federated and collaborative learning settings.

dataexploitationmemorizationtraining

Misbruik van memorisatie van trainingsdata

Techniques for exploiting model memorization to extract verbatim training examples.

attacksinferencedatapropertytraining

Property inference-aanvallen

Inferring global properties of training datasets through model behavior analysis.

syntheticpracticaldatapoisoningtraining

Praktische vergiftiging van synthetische data

Poisoning synthetic data generation pipelines used for model training augmentation.

data-poisoningtrainingclean-labelfeature-collisionbilevel-optimizationdetection-evasion

Methoden voor datavergiftiging

Practical methodology for poisoning training datasets at scale, including crowdsource manipulation, web-scale dataset attacks, label flipping, feature collision, bilevel optimization for poison selection, and detection evasion techniques.

trainingfine-tuningdata-poisoningbackdoortrojanlorasleeper-agentmodel-merging

Aanvallen op training en fine-tuning

Methodology for data poisoning, trojan/backdoor insertion, clean-label attacks, LoRA backdoors, sleeper agent techniques, and model merging attacks targeting the LLM training pipeline.

mitigationtrainingadversarialdefense

Adversarial training voor LLM-verdediging

Gebruik adversarial-trainingstechnieken om de robuustheid van LLM's tegen bekende aanvalspatronen te verbeteren.

defenseclassifiertraining

Trainen van prompt-injectieclassifiers

Methodologieën voor het trainen en evalueren van ML-classifiers die prompt-injectiepogingen met hoge nauwkeurigheid detecteren.

embeddingbackdoortrainingmanipulation

Backdoor-aanvallen op embeddings

Het inbouwen van backdoors in embeddingmodellen waardoor specifieke triggerinvoer vooraf bepaalde embeddingvectoren produceert voor adversarial retrieval.

trainingrlhffine-tuningalignmentintermediate

Pre-training → fine-tuning → RLHF-pijplijn

Begrijp de drie fasen van het maken van een gealigneerde LLM — pre-training, supervised fine-tuning en RLHF/DPO — en de beveiligingsimplicaties in elke fase.

understandingtrainingsafetyfoundations

Inzicht in LLM-veiligheidstraining

Hoe veiligheidstraining werkt, waaronder RLHF, DPO en Constitutional AI, en waarom ze omzeild kan worden.

alignment-fakingtrainingrlhfsafety-trainingevaluationai-safety

Trainingsimplicaties van Alignment Faking

Hoe alignment faking de trainingsmethodologie beïnvloedt, inclusief implicaties voor RLHF, het ontwerp van veiligheidstraining, evaluatievaliditeit en de ontwikkeling van trainingsbenaderingen die bestand zijn tegen strategische compliance.

frontier-researchsleeper-agentsdeceptivetraining

Sleeper Agent-onderzoek

Actueel onderzoek naar het trainen van misleidende LLM's die veiligheidstraining overleven en activeringspatronen.

frontier-researchsynthetic-datapoisoningtraining

Vergiftiging van Synthetische Data in Trainingspipelines

Onderzoek naar het vergiftigen van pipelines voor het genereren van synthetische data die worden gebruikt voor modeltraining en fine-tuning.

frontier-researchmodel-collapsesecuritytraining

Model Collapse en beveiligingsimplicaties

Beveiligingsimplicaties van model collapse door training op AI-gegenereerde gegevens in iteratieve trainingslussen.

infrastructuredistributedtrainingsecurity

Beveiliging van gedistribueerde training

Beveiligingsoverwegingen voor gedistribueerde modeltraining over meerdere nodes en datacenters.

classifiercustomadvancedlabtraininglabs

Training van een custom safety-classifier

Train a custom input safety classifier and then develop payloads that reliably evade it to understand classifier limitations.

probingsafetylabbeginnertraininglabs

Probing van de grenzen van veiligheidstraining

Systematically probe the boundaries of RLHF safety training to understand where and how safety behaviors are enforced.

attackstimelabexperttraininglabs

End-to-end aanvallen tijdens training

Execute a complete training-time attack from data poisoning through model deployment to triggered exploitation.

multimodalalignmentsafetytrainingcross-modal

Alignment-uitdagingen in multimodale modellen

Analyse van alignment-uitdagingen die specifiek zijn voor multimodale AI-systemen, waaronder cross-modale veiligheidshiaten, representatieconflicten en de moeilijkheid om op tekst gebaseerde veiligheidstraining uit te breiden naar visuele, audio- en video-invoer.

professionaltrainingawarenessdevelopers

AI-beveiligingsbewustzijnstraining voor developers

Designing and delivering AI security awareness programs that help developers recognize and mitigate AI-specific security risks in their daily work.

professionalcertificationstrainingcareer-development

Landschap van AI-beveiligingscertificeringen (professional)

Comprehensive guide to certifications, training programs, and credentials relevant to AI security practitioners.

programsecurityprofessionaltraining

Ontwerp van een trainingsprogramma voor AI-beveiliging

Designing and delivering AI security training programs for development and security teams.

Branchecertificeringen en trainingen

Comprehensive guide to certifications, training programs, and educational resources relevant to AI red teaming, including security certifications, ML courses, and specialized AI security training.

certificationstraining

professionalcertificationstrainingcredentials

Certificeringen in AI-beveiliging

Overview of relevant certifications and training programs for AI security professionals.

professionaltrainingprogrameducation

Ontwikkeling van een trainingsprogramma

Developing comprehensive AI red team training programs from beginner to advanced levels, including curriculum design and practical exercises.

synthetic-datamodel-collapsequality-degradationdistributiontraining

Risico's van synthetische data

Modelinstorting door trainen op synthetische data, kwaliteitsdegradatie over generaties, distributievernauwing, uitwissing van minderheden, en strategieën voor veilig gebruik van synthetische data in LLM-training.

trainingalignment-taxtradeoffs

Alignment Tax: Afwegingen tussen Veiligheid en Capaciteit

Kwantitatieve analyse van de prestatiekosten van veiligheidstraining en alignment-technieken voor de capaciteiten van modellen.

trainingcontinual-learningdrift

Drift-aanvallen via continual learning

Het exploiteren van continual learning en online training om modelgedrag geleidelijk te verschuiven richting adversarial doelen.

trainingdistillationsafety-gap

Veiligheidskloof bij Knowledge Distillation

Analyse van het verlies van veiligheidseigenschappen tijdens knowledge distillation van docent- naar studentmodellen.

DPO- en IPO-trainingskwetsbaarheden

Beveiligingsanalyse van trainingsmethoden voor Direct Preference Optimization en Identity Preference Optimization.

trainingdpoipo

training-pipelinedpotrainingvulnerabilities

DPO-trainingskwetsbaarheden

Beveiligingsanalyse van Direct Preference Optimization-training en de kwetsbaarheid ervan voor preferentievergiftiging.

trainingevaluationcontamination

Aanvallen via vervuiling van de evaluatieset

Aanvallen op evaluatiebenchmarks en testsets om een vals beeld te creëren van de veiligheid en capaciteiten van een model.

traininggradientpoisoning

Gradiënt-gebaseerde datavergiftiging (trainingspijplijn)

Gradiëntinformatie gebruiken om optimaal adversariële trainingsvoorbeelden te vervaardigen voor gerichte modelmanipulatie.

trainingpre-trainingfine-tuningarchitecturedata-poisoningrlhfalignment

Beveiliging van de trainingspijplijn

Beveiliging van de volledige AI-modeltrainingspijplijn, met aandacht voor pre-training-aanvallen, fine-tuning- en alignment-manipulatie, kwetsbaarheden op architectuurniveau en geavanceerde dreigingen tijdens de training.

instructionpipelinetuningmanipulationtraining

Manipulatie van instruction-tuning-data

Het manipuleren van instruction-tuning-datasets om specifiek gedrag in het resulterende model in te bedden.

attackspipelinedistillationknowledgetraining

Beveiliging van kennisdistillatie

Beveiligingsimplicaties van kennisdistillatie, waaronder capaciteitsextractie en overdracht van safety-alignment.

trainingmodel-mergingsafety

Beveiligingsimplicaties van model merging

Analyse van hoe model-merging-technieken (TIES, DARE, SLERP) safety-eigenschappen en alignment beïnvloeden.

securityanalysispipelinemergetrainingmodel

Beveiligingsanalyse van model merging (trainingspipeline)

Beveiligingsanalyse van model-merging-technieken en de voortplanting van kwetsbaarheden via samengevoegde modellen.

trainingweightsmanipulation

Technieken voor manipulatie van modelgewichten

Directe manipulatie van modelgewichten om backdoors te injecteren, gedrag te wijzigen en safety-training te omzeilen.

trainingpre-trainingsafety

Safety-interventies tijdens pre-training

Analyse van safety-interventies die tijdens pre-training worden toegepast, waaronder datafiltering, loss-weighting en curriculumontwerp.

preferencepipelinedatapoisoningtraining

Vergiftiging van voorkeursdata (Training Pipeline)

Het vergiftigen van voorkeursdata gebruikt in RLHF en DPO om de alignment van het model richting de doelstellingen van de aanvaller te verschuiven.

trainingrlhfreward-hacking

RLHF Reward Hacking Diepgaande Analyse

Diepgaande analyse van reward hacking-technieken in RLHF-pijplijnen, inclusief overoptimalisatie en specification gaming.

trainingfine-tuningsafety-reversal

Aanvallen om safety fine-tuning terug te draaien

Technieken om safety fine-tuning terug te draaien via gerichte fine-tuning op tegenstrijdige datasets.

trainingsynthetic-datapoisoning

Vectoren voor vergiftiging van synthetische data

Aanvalsvectoren die specifiek zijn voor pipelines voor het genereren van synthetische data die worden gebruikt bij modeltraining en -augmentatie.

trainingtokenizerpoisoning

Tokenizer-vergiftigingsaanvallen

Het aanvallen van tokenizer-training en -vocabulaire om tegenstrijdige tokenpatronen te creëren die veiligheidsmaatregelen omzeilen.

trainingdata-curationpoisoning

Aanvallen op de curatie van trainingsdata

Attacking the data curation pipeline to inject adversarial examples into training datasets at scale.

provenancetrainingpipelinedata

Aanvallen op de herkomst van trainingsdata

Attacking training data provenance and attribution systems to inject unverified data sources.

trainingtransfer-learningsecurity

Beveiligingsanalyse van transfer learning

Security implications of transfer learning including inherited vulnerabilities and cross-domain attack transfer.

walkthroughsfine-tuningsafety-bypasstraining

Walkthrough: veiligheid omzeilen via fine-tuning

Walkthrough of using fine-tuning API access to remove safety behaviors from aligned models.

classifiermachine-learningprompt-injectiondetectiontrainingdefensewalkthrough

Training van een prompt-classifier

Step-by-step walkthrough for training a machine learning classifier to detect malicious prompts, covering dataset curation, feature engineering, model selection, training pipeline, evaluation, and deployment as a real-time detection service.

walkthroughsdefensesafety-classifiertraining

Aangepaste veiligheidsclassifiers trainen

Train custom safety classifiers tuned to your application's specific threat model and content policy.

walkthroughsdefenseclassifiertraining

Een prompt injection-classifier trainen

Train a custom prompt injection detection classifier using labeled datasets and modern NLP techniques.