進階 AI 紅隊實驗室
結合多種攻擊向量並需要精密工具使用的進階實驗室——PAIR/TAP 攻擊、對抗性後綴、微調後門與護欄繞過鏈。
進階實驗室要求你跨領域整合知識。每個實驗室通常有多條解答路徑,最優雅的解答需要技術的創意結合。
結合多種攻擊向量並需要精密工具使用的進階實驗室——PAIR/TAP 攻擊、對抗性後綴、微調後門與護欄繞過鏈。
進階實驗室要求你跨領域整合知識。每個實驗室通常有多條解答路徑,最優雅的解答需要技術的創意結合。
Implement the PAIR (Prompt Automatic Iterative Refinement) algorithm where an attacker LLM iteratively refines jailbreak prompts against a target LLM until a successful attack is found.
Implement the TAP (Tree of 攻擊s with Pruning) algorithm that uses tree-based search over attack prompts with branch pruning to efficiently find jailbreaks.
Build a full-featured, production-quality red team harness with multi-model support, async testing, structured result storage, and HTML reporting.
Test the same attack suite across GPT-4, Claude, Llama, and Gemini. Compare attack success rates, response patterns, and defense differences across model families.
Build an automated multimodal attack pipeline that generates adversarial images, combines them with text prompts, and tests against vision-language models (VLMs).
Extract memorized training data from language models using prefix-based extraction, divergence testing, and membership inference. Measure extraction rates and assess privacy risks.
攻擊 reasoning models like o1, o3, and DeepSeek-R1 by exploiting chain-of-thought manipulation, reasoning budget exhaustion, and thought-injection techniques.
Build an end-to-end CART pipeline that continuously generates, executes, and scores adversarial attacks against LLM applications, with alerting and trend tracking.
Implement the Greedy Coordinate Gradient (GCG) algorithm to generate adversarial suffixes that cause language models to comply with harmful requests by appending optimized token sequences.
進階 lab on identifying, isolating, and chaining multiple guardrail bypass techniques to defeat layered defense systems in production LLM applications.
進階 lab demonstrating how fine-tuning can insert hidden backdoors into language models that activate on specific trigger phrases while maintaining normal behavior otherwise.
Hands-on lab for understanding and simulating poisoning attacks against federated learning systems, where a malicious participant corrupts the shared model through crafted gradient updates.
Hands-on lab for crafting adversarial audio perturbations that cause speech-to-text models and voice assistants to misinterpret spoken commands, demonstrating attacks on audio AI systems.
Test whether jailbreaks discovered on one language model transfer effectively to others, building a systematic methodology for cross-model vulnerability research.
Hands-on lab for crafting adversarial prompts on open-weight models like Llama that transfer to closed-source models like Claude and GPT-4, using iterative refinement and cross-model evaluation.
Develop adversarial attacks on open-source models that transfer to closed-source models, leveraging weight access for black-box exploitation.
Hands-on lab for chaining three or more distinct vulnerabilities into a complete exploit sequence that achieves objectives impossible with any single technique alone.
Systematically compare the safety posture of major language models using a standardized test suite, building quantitative security profiles for GPT-4, Claude, and Gemini.
Build a tailored testing framework for a specific AI application, with custom attack generators, domain-specific evaluators, and application-aware reporting.
Use multiple language models collaboratively to discover attack strategies that bypass any single model's defenses, leveraging model diversity for more effective red teaming.
Conduct an end-to-end security assessment of a cloud-deployed AI service, covering API security, model vulnerabilities, data handling, and infrastructure configuration.
Hands-on lab for conducting an end-to-end security assessment of a cloud-deployed AI system including infrastructure review, API testing, model security evaluation, and data flow analysis.
攻擊 systems that route requests to different models based on complexity or content, exploiting routing logic to reach less-defended models or bypass safety filters.
Build a complete Prompt Automatic Iterative Refinement system that uses an attacker LLM to automatically generate and refine jailbreak prompts against a target model.
Simultaneously attack and defend an AI application in a structured exercise where red team findings immediately inform blue team defensive improvements.
Hands-on lab for conducting simultaneous attack and defense operations against an AI system with real-time metrics tracking, adaptive defense deployment, and coordinated red-blue team workflows.
Create a systematic fuzzing framework for testing LLM boundaries, generating and mutating inputs to discover unexpected model behaviors and safety edge cases.
Hands-on lab for executing a complete RAG attack chain from document injection through retrieval manipulation to data exfiltration, targeting every stage of the Retrieval-Augmented Generation pipeline.
Hands-on lab for conducting a complete 模型 Context Protocol server compromise including tool enumeration, permission escalation, cross-tool attacks, and data exfiltration through MCP channels.
Build automated pipelines that detect safety degradation across model versions, ensuring that updates and fine-tuning do not introduce new vulnerabilities or weaken existing protections.
Build an orchestration system that coordinates multiple attack strategies simultaneously, managing parallel attack campaigns and synthesizing results into comprehensive risk assessments.
Hands-on lab for executing 20+ turn conversation steering strategies that gradually erode AI safety boundaries to extract protected information through patient, escalating dialogue.
攻擊 vLLM, TGI, and Triton inference servers to discover information disclosure vulnerabilities, denial-of-service vectors, and configuration weaknesses in model serving infrastructure.
探索自容器化 AI 應用逃逸至主機系統之技術,測試 ML 部署環境中之容器隔離邊界。
Demonstrate information leakage through GPU memory residuals and timing side channels, showing how shared GPU infrastructure can expose sensitive model data.
利用 vulnerabilities in TensorFlow Serving, TorchServe, and Triton Inference Server, targeting model loading, API endpoints, and management interfaces.
Investigate logs and artifacts from a compromised AI system to reconstruct the attack chain, identify the vulnerability exploited, and determine the scope of the breach.
Analyze a fine-tuned language model to find and characterize an inserted backdoor, using behavioral probing, activation analysis, and statistical testing techniques.
Implement the PAIR (Prompt Automatic Iterative Refinement) attack for automated jailbreaking.
Build a TAP attack system that generates and prunes jailbreak attempts using tree search.
Implement the Greedy Coordinate Gradient attack to generate adversarial suffixes.
Insert a triggered backdoor during fine-tuning that activates on specific input patterns.
Build an ensemble attack system that combines multiple bypass techniques for robustness.
Build a complete red team testing harness with parallel execution, logging, and scoring.
Embed adversarial text in images that triggers prompt injection in vision-language models.
Develop attacks on open-source models that transfer to closed-source commercial APIs.
Chain multiple tool calls in an agent system to achieve multi-step exploitation.
利用 reward model weaknesses to craft inputs that score highly on safety metrics while achieving adversarial objectives.
Execute sophisticated RAG poisoning including gradient-guided document crafting.
Craft inputs that score highly on reward models while containing adversarial content.
Analyze security implications of model merging techniques and test for backdoor propagation through merged model weights.
Use differential testing to find behavior inconsistencies across model providers.
Test behavioral differences between full-precision and quantized models to discover quantization-induced vulnerabilities.
Build an automated framework to evaluate defensive measures across attack categories.
Simulate attacks on distributed training infrastructure including gradient poisoning and aggregation manipulation.
Extract system prompt information using token log probability analysis.
Detect and analyze LLM text watermarks using statistical methods and test watermark removal through paraphrasing.
Chain exploits across multiple MCP servers to achieve lateral movement in agent systems.
Use representation engineering to analyze and manipulate internal model representations for security research.
Craft adversarial inputs that produce target embeddings for retrieval manipulation.
Test and bypass Constitutional AI safety mechanisms by exploiting the critique-revision training methodology.
利用 structured output requirements to force models into revealing restricted information.
Build a complete automated red teaming pipeline with attack generation, execution, scoring, and reporting.
Reverse-engineer safety classifier decision boundaries through systematic probing.
Develop reward hacking exploits that manipulate RLHF reward models through adversarial optimization.
Implement Microsoft's Crescendo attack pattern for gradual multi-turn jailbreaking.
Implement gradient-based adversarial attacks on open-weight models including PGD, FGSM adapted for text inputs.
Use activation steering vectors to control model behavior without prompt modification for security testing.
Implement token-level adversarial optimization to discover minimal perturbations that bypass safety training.
Construct attack chains that span text, image, and audio modalities to exploit cross-modal processing gaps.
Extract the decision boundary of safety classifiers through systematic probing to craft maximally evasive payloads.
Test whether fine-tuned backdoors persist through subsequent safety training rounds and RLHF alignment.
Build a compiler that transforms high-level attack specifications into optimized prompt injection payloads.
Execute model poisoning attacks in a federated learning simulation by manipulating local model updates.
Analyze jailbreak transferability across model families to discover universal vulnerability patterns.
Chain tool call exploits to achieve lateral movement across services connected to LLM agent systems.
Extract information from long-term agent memory stores through crafted queries and memory retrieval manipulation.
Systematically probe alignment boundaries to map the exact thresholds where safety training engages and disengages.
利用 sparse attention patterns in long-context models to hide injection payloads in low-attention regions.
利用 inference-time compute scaling to manipulate reasoning depth and resource consumption in thinking models.
Simulate model supply chain attacks by injecting backdoors into model weights distributed through public registries.
Demonstrate how to game safety evaluation frameworks to produce artificially high safety scores while retaining vulnerabilities.
利用 training data ordering and curriculum learning to amplify the impact of small numbers of poisoned examples.
Build an advanced red team orchestration system that coordinates multiple attack agents against a defended target.
Develop attacks in one language that transfer to others by exploiting shared multilingual representation spaces.
Manipulate chain-of-thought reasoning traces to inject false premises and redirect model conclusions.
Implement and test neural network trojan detection methods including activation clustering and spectral analysis.
Implement the AutoDAN methodology for generating human-readable stealthy jailbreak prompts using gradient guidance.
Probe model internal representations to discover exploitable features and latent vulnerability patterns.
Craft inputs that exploit reward model weaknesses to achieve high safety scores while containing harmful content.
利用 trust boundaries between cooperating agents to escalate privileges and access restricted capabilities.
Inject malicious tasks into 代理-to-代理 protocol communication channels to redirect multi-agent workflows.
Use fine-tuning API access to systematically remove safety alignment with minimal training examples.
Develop techniques to bypass Anthropic-style constitutional classifiers through adversarial input crafting.
Implement embedding inversion to recover original text from vector database embeddings.
Chain exploits across multiple MCP servers to achieve lateral movement and capability escalation in agent systems.
Inject adversarial content into screenshots and UI elements processed by computer-use AI agents.
Implement Carlini et al.'s techniques to extract memorized training data from production language model APIs.
Detect and remove statistical watermarks from LLM-generated text while preserving content quality.
Reverse-engineer a safety classifier's decision boundaries through systematic adversarial probing.
Manipulate agent workflow state machines to skip validation steps and reach privileged execution paths.
Generate adversarial suffixes on open-source models and test their transferability to commercial APIs.
Reproduce and analyze LangChain CVEs including CVE-2023-29374 and CVE-2023-36258 in a safe lab environment.
Use differential testing across model versions and providers to discover inconsistent safety behaviors.
Craft adversarial audio that embeds prompt injection payloads when transcribed by speech-to-text models.
Develop and evaluate custom attack methods against the HarmBench standardized evaluation framework.
Bypass document-level access controls in enterprise RAG systems through query manipulation and context injection.
Insert triggered backdoors through LoRA fine-tuning that activate on specific input patterns while passing safety evals.
Hide prompt injection payloads in images using steganographic techniques undetectable to human observers.
Orchestrate attacks across text, image, and document modalities to bypass per-modality safety filters.
Craft inputs that manipulate transformer attention patterns to prioritize adversarial content over safety instructions.
Inject persistent instructions into agent memory systems that survive across conversation sessions.
Implement the AutoDAN methodology for generating stealthy human-readable jailbreak prompts using LLM feedback.
Build comprehensive red team test suites in Promptfoo with custom graders and multi-model targeting.
Develop and test sandbox escape techniques against code execution environments in AI coding assistants.
Probe internal model representations to identify exploitable features and develop representation-level attacks.
Train a custom input safety classifier and then develop payloads that reliably evade it to understand classifier limitations.
Test how model quantization (INT8, INT4, GPTQ) degrades safety alignment and introduces exploitable gaps.
Use gradient information from open-source models to craft optimally poisoned training examples.
利用 visible chain-of-thought reasoning traces in models like o1 and DeepSeek-R1 to manipulate outputs.
Build a real-time dashboard for tracking and visualizing red team campaign results across targets and techniques.
Optimize attack payloads for multiple simultaneous objectives: jailbreaking, data extraction, and defense evasion.
Extract model capabilities through distillation techniques using only black-box API access.
Analyze and exploit reward model biases to craft responses that score high on safety while embedding harmful content.
Chain exploits across multiple LLM plugins to achieve capabilities not available through any single plugin.
Perform sophisticated RAG manipulation including embedding space attacks, metadata poisoning, and retrieval algorithm gaming.
Build a comprehensive adversarial robustness evaluation framework for assessing model security posture.
Chain attacks across text, image, and structured data modalities to exploit multimodal system vulnerabilities.
Alternate between attacking and defending an LLM application to develop skills in both offensive and defensive operations.
Test attacks against a simulated production environment with realistic logging, monitoring, and alerting.