中階 AI 紅隊實驗室
構成專業 AI 紅隊演練骨幹的核心攻擊技術實驗室——防禦繞過、代理利用、RAG 投毒與更多。
中階實驗室介紹構成專業 AI 紅隊演練骨幹的核心攻擊技術。每個實驗室針對中等防禦系統練習,要求你調整技術至特定目標。
Loading...
構成專業 AI 紅隊演練骨幹的核心攻擊技術實驗室——防禦繞過、代理利用、RAG 投毒與更多。
中階實驗室介紹構成專業 AI 紅隊演練骨幹的核心攻擊技術。每個實驗室針對中等防禦系統練習,要求你調整技術至特定目標。
Hands-on lab for exploiting AI agents with tool access through indirect prompt injection, tool-call chaining, and privilege escalation techniques.
Hands-on lab for exploiting Model Context Protocol tool definitions through malicious tool descriptions, tool confusion attacks, and server impersonation.
Hands-on lab for executing multi-turn crescendo attacks against LLMs, measuring safety degradation over conversation length, and building persistent attack campaigns.
Hands-on lab for setting up a RAG pipeline with LlamaIndex, injecting malicious documents, testing retrieval poisoning, and measuring injection success rates.
Hands-on lab for crafting documents optimized to be retrieved for specific queries through embedding collision attacks using sentence-transformers.
Hands-on lab for building an LLM-based evaluator to score red team attack outputs, compare model vulnerability, and lay the foundation for automated attack campaigns.
Hands-on lab for setting up indirect prompt injection scenarios through web pages, emails, and documents, testing multi-hop injection chains against AI systems.
Hands-on lab for extracting sensitive data from AI systems including system prompt extraction, context leakage via markdown rendering, and URL-based data exfiltration.
Hands-on lab for API-based model extraction attacks, querying a target model to approximate its behavior, measuring fidelity, and understanding query budgets.
Hands-on lab for configuring and running automated red teaming campaigns using Microsoft's PyRIT framework against LLM applications.
Hands-on lab for methodically identifying, categorizing, and bypassing common LLM defense mechanisms including input filters, output filters, and system prompt hardening.
Hands-on lab for setting up promptfoo to run automated regression tests against LLM applications, ensuring that safety properties hold across model updates and prompt changes.
Hands-on lab exploring how conversational memory in AI agents can be poisoned to alter future behavior, inject persistent instructions, and exfiltrate data across sessions.
Hands-on lab demonstrating how oversized inputs can overwhelm an LLM's context window to dilute safety instructions, push system prompts out of the attention window, or cause instruction amnesia.
Hands-on lab exploring how tokenization boundaries, token-level manipulations, and encoding tricks can smuggle attack payloads past safety classifiers.
Hands-on lab exploring how text instructions embedded in images can be used to perform prompt injection against vision-language models (VLMs) that process visual input.
Hands-on lab exploring how attackers can manipulate LLM function calling and tool use to execute unauthorized actions, exfiltrate data, and chain tool calls for maximum impact.
Inject instructions through external data sources including documents, web pages, and emails that a target AI system processes as context.
Hands-on lab for exploiting function calling mechanisms by crafting inputs that manipulate which functions get called and with what parameters.
Hands-on lab for embedding text instructions in images to exploit vision-enabled LLMs. Learn to craft visual prompt injections, test OCR-based attacks, and evaluate multimodal safety boundaries.
Hands-on lab for extracting data from AI systems through markdown image rendering, invisible links, tool call parameters, and other covert exfiltration channels.
Hands-on lab for poisoning tool outputs to redirect agent behavior by injecting malicious content through tool results.
Embed prompt injection instructions in images for vision-enabled models, testing how visual content can carry adversarial payloads.
Extract sensitive information from AI systems through various exfiltration channels including crafted links, image tags, tool calls, and side-channel leakage.
Hands-on lab for quantifying AI guardrail robustness using attack success rates, evasion metrics, false positive rates, and statistical analysis of defense performance.
Hands-on lab for attacking a vulnerable MCP server, demonstrating tool shadowing, consent phishing, and server impersonation techniques.
Systematically test and measure the robustness of AI guardrails using structured methodology, metrics, and repeatable test suites.
Hands-on lab for injecting persistent instructions into an agent's memory and context that affect future interactions and conversations.
Hands-on lab for building a continuous AI red team testing pipeline using promptfoo, GitHub Actions, and automated attack generation to catch safety regressions before deployment.
Set up promptfoo for continuous automated security testing of AI applications, with custom evaluators and CI/CD integration.
Hands-on lab for tricking AI agents into performing actions beyond their intended scope through privilege escalation techniques.
Hands-on lab for auditing machine learning model dependencies, detecting malicious packages in ML pipelines, and scanning model files for backdoors and supply chain threats.
Hands-on lab for systematically testing Base64, ROT13, Unicode homoglyph, hexadecimal, and combined encoding bypasses against LLM safety filters with measured success rates.
Hands-on lab for redirecting multi-step agent workflows to perform unintended actions by manipulating intermediate states and control flow.
Audit an ML project's dependencies for vulnerabilities, covering model files, Python packages, container images, and training data provenance.
Hands-on lab for crafting documents that reliably get retrieved and influence RAG responses for specific target queries.
Hands-on lab for extracting system prompts from tool-using AI agents by exploiting tool interactions, observation channels, and multi-step reasoning to leak protected instructions.
Hands-on lab for exploiting metadata fields like titles, descriptions, and timestamps to manipulate RAG retrieval ranking and influence responses.
Hands-on lab for crafting documents that split across chunks in ways that hide malicious content from chunk-level filtering while maintaining attack effectiveness.
Hands-on lab for manipulating the re-ranking stage of RAG pipelines to promote or suppress specific documents in retrieval results.
Hands-on lab for getting RAG systems to cite documents that don't exist or misattribute quotes to legitimate sources.
Hands-on lab for systematically testing and bypassing AWS Bedrock's built-in guardrails including content filters, denied topics, and word filters.
Hands-on lab exploring how injected few-shot examples can steer language model outputs toward attacker-chosen behaviors by exploiting in-context learning.
Hands-on lab for mapping and testing Azure OpenAI Service content filtering categories, severity levels, and bypass techniques.
Hands-on lab building self-reinforcing prompt injection chains where each model response amplifies the injection, creating feedback loops that progressively override safety controls.
Hands-on lab using Base64, ROT13, Unicode normalization, and custom encoding schemes to evade input filters and safety classifiers in language model systems.
Hands-on lab exploring how overflowing a model's context window with padding content can push safety instructions out of the attention window and enable injection attacks.
Hands-on lab exploiting language translation capabilities to bypass safety filters by submitting prompts in low-resource languages, code-switching, and using transliteration tricks.
Hands-on lab exploring how Markdown rendering in AI-generated outputs can be exploited to inject hidden content, exfiltrate data through image tags, and manipulate displayed information.
Hands-on lab exploring how adversarial payloads injected through structured JSON inputs can manipulate language model behavior, bypass schema validation, and exploit parsing inconsistencies.
Hands-on lab building multi-turn attack sequences that progressively escalate from benign queries to full safety bypass through gradual trust establishment and boundary pushing.
Plant prompt injection payloads in web pages consumed by RAG-enabled LLM applications.
Exploit MCP tool descriptions to inject instructions that redirect agent behavior.
Poison a vector database to inject adversarial content into RAG retrieval results.
Exploit persistent memory in LLM agents to plant false context that persists across sessions.
Chain attacks across multiple LLM models in a pipeline to bypass per-model defenses.
Use Unicode homoglyphs, zero-width characters, and tokenizer quirks to smuggle payloads.
Craft responses that exploit LLM-as-judge evaluation patterns to achieve high safety scores while embedding harmful content.
Exploit JSON mode and structured output constraints to bypass safety filters.
Manipulate text to achieve target embedding similarity scores for retrieval poisoning.
Exploit advanced function calling patterns including nested calls, parallel execution, and schema manipulation.
Systematically map the rules and thresholds of input/output guardrail systems.
Test multi-step agent workflows for injection points, state manipulation, and unauthorized action execution.
Exploit conversation history manipulation to alter model behavior across turns.
Perform advanced security testing of MCP servers including transport analysis, capability enumeration, and tool poisoning.
Systematically fuzz LLM API parameters to discover unexpected behaviors and bypasses.
Inject prompts that persist across separate conversation contexts in shared deployments.
Compare security postures across multiple LLM providers by running identical attack suites and analyzing differential responses.
Combine multiple bypass techniques into reliable chains that defeat layered defenses.
Practice guardrail bypass techniques against NeMo Guardrails, LLM Guard, and custom classifier-based defenses.
Manipulate embedding vectors to achieve target similarity scores for RAG poisoning and retrieval manipulation.
Use Microsoft's PyRIT framework to run automated multi-turn attack campaigns.
Test RAG systems for chunking exploitation, reranking manipulation, and cross-document injection attacks.
Generate targeted jailbreak wordlists using mutation strategies and seed prompts.
Extract data from LLM systems using timing, token probability, and formatting side channels.
Measure how fine-tuning affects model safety by comparing pre and post fine-tuning safety benchmark scores.
Test LLM code generation for insecure patterns, injection vulnerabilities, and code execution safety issues.
Craft semantically coherent injections that evade both classifiers and human review.
Test multimodal LLMs with crafted images containing embedded text, adversarial perturbations, and visual injection payloads.
Exploit language switching and low-resource language gaps to bypass safety training.
Test cloud AI platform configurations including AWS Bedrock guardrails, Azure content filters, and GCP safety settings.
Test CI/CD pipeline security for LLM applications including prompt template injection, model artifact tampering, and deployment attacks.
Deploy a comprehensive security monitoring system for LLM applications with anomaly detection and alerting.
Practice AI incident response procedures through a simulated prompt injection incident with escalation and containment.
Test LLM applications against OWASP LLM Top 10 and NIST AI RMF requirements using structured test procedures.
Exploit chain-of-thought reasoning to leak internal model reasoning, bypass safety filters, and manipulate decision processes.
Inject adversarial content through tool call results to poison model reasoning and redirect subsequent actions.
Test session isolation in multi-tenant LLM deployments to detect cross-session data leakage and context contamination.
Manipulate semantic search results by crafting documents that achieve target similarity scores against specific queries.
Exploit structured output schemas to force models into revealing information through required field population.
Exploit prompt caching mechanisms to detect cached prompts and extract information through timing side channels.
Test data exfiltration paths through LLM-triggered webhooks, API callbacks, and external URL rendering.
Reconstruct full system prompts through iterative partial extraction and cross-referencing multiple leakage vectors.
Evade advanced input classifiers using adversarial perturbation, semantic paraphrasing, and gradient-free optimization.
Exploit conversation state tracking to build persistent attack context across multiple conversation turns.
Exploit Unicode normalization differences between input validation and model processing to smuggle injection payloads.
Exploit multi-role LLM architectures by manipulating the instruction hierarchy between system, user, and assistant roles.
Test vector database access controls for bypass vulnerabilities including namespace traversal and filter manipulation.
Exploit streaming response mechanisms to bypass output filters that only check complete responses.
Fuzz MCP protocol messages to discover parsing errors, validation gaps, and unexpected behavior in MCP servers.
Build automated systems for generating adversarial prompts using mutation, crossover, and selection strategies.
Exploit hallucination tendencies to trigger fabricated tool calls, invented API endpoints, and false fact injection.
Use token log probabilities to extract information about training data, system prompts, and model internals.
Optimize prompt injection placement within the context window to maximize attack effectiveness using attention dynamics.
Build a regression testing framework to continuously verify that LLM defenses remain effective against known attack patterns.
Exploit version differences in LLM APIs to find deprecated endpoints and backward-compatible safety bypasses.
Quantitatively measure instruction following compliance to identify where models prioritize competing instructions.
Deploy and test canary tokens in system prompts to evaluate their effectiveness at detecting prompt extraction.
Test security of inter-agent communication in multi-agent systems including message injection and trust boundary violations.
Benchmark safety filters across providers using standardized test suites to compare detection rates and false positives.
Craft malicious tool return values that inject instructions back into the model's reasoning chain.
Implement Anthropic's many-shot jailbreaking technique with scaling analysis across conversation lengths.
Design and execute a document injection campaign against a RAG-powered application with vector search.
Use timing side channels to identify and characterize guardrail implementations in LLM applications.
Inject false messages into conversation history to alter model behavior in subsequent turns.
Exploit token probability APIs to reconstruct hidden system prompts character by character.
Systematically probe input classifiers to map their decision boundaries and find bypass regions.
Exploit function calling schemas to inject instructions through parameter descriptions and enums.
Exploit MCP resource URIs to access and exfiltrate data beyond intended scope boundaries.
Implement Microsoft's Crescendo multi-turn escalation attack with automated conversation management.
Bypass output safety filters by instructing the model to encode responses in Base64, hex, or custom schemes.
Redirect an AI agent's objectives through carefully crafted inputs that override its primary task.
Chain exploits across multiple plugins/tools in an agent system to achieve multi-step attacks.
Run standardized jailbreak benchmarks across multiple models and compare vulnerability profiles.
Craft adversarial documents that rank highly in semantic search for targeted queries in RAG systems.
Systematically identify and exploit safety training gaps across low-resource languages.
Test and bypass NVIDIA NeMo Guardrails configurations using known evasion techniques.
Develop evasion techniques against LLM Guard input scanners and output detectors.
Exploit the boundaries of context windows to push safety instructions beyond the model's attention.
Build custom attack strategies and converters in Microsoft's PyRIT framework for targeted campaigns.
Chain multiple encoding schemes to bypass multi-layer input filters in production LLM applications.
Exploit structured JSON output modes to bypass safety filters that only check plain-text responses.
Embed adversarial text in images that vision-language models read and follow as instructions.
Craft semantically coherent prompts that conceal adversarial intent from both human reviewers and ML classifiers.
Exploit assistant message prefilling to prime model responses and bypass safety alignment.
Exploit context window management to inject persistent adversarial content that influences future model responses.
Develop custom probe modules for NVIDIA Garak to test organization-specific vulnerability patterns.
Use JSON mode and structured output constraints to extract information that would be refused in free-form responses.
Exploit tokenizer-specific behavior by crafting inputs that split across token boundaries in unexpected ways.
Attack a pipeline where multiple models process data sequentially, exploiting trust between pipeline stages.
Craft adversarial PDF documents that inject instructions when processed by RAG document loaders.
Manipulate text to achieve target embedding similarities for poisoning retrieval augmented generation systems.
Exploit streaming API responses to exfiltrate data that might be caught by batch output filters.
Develop payloads that evade machine learning-based input classifiers through adversarial text perturbation.
Inject adversarial content through template variable expansion in prompt management systems.
Exploit state management in multi-turn conversations to gradually shift model behavior over many turns.
Exploit session management vulnerabilities in LLM applications to access other users data or escalate privileges.
Build custom safety evaluation tasks in Inspect AI to test application-specific vulnerability classes.
Hijack the model's chain-of-thought reasoning to redirect its conclusions toward attacker-controlled objectives.
Systematically extract the rules and decision boundaries of guardrail systems through targeted probing.
Probe fine-tuning APIs for security weaknesses including insufficient validation and unsafe default configurations.
Exploit gaps in safety training coverage across languages to bypass safety mechanisms using code-switching.
Inject adversarial content into documents that will be processed by a RAG system to influence model responses.
Split adversarial payloads across multiple conversation turns to bypass per-message safety classification.
Discover and exploit API abuse patterns including parameter injection, header manipulation, and endpoint confusion.
Fingerprint the underlying model behind an API endpoint to identify its type, version, and configuration.
Discover alternative output channels in LLM applications that bypass standard output filtering.
Practice multiple system prompt extraction techniques and measure their effectiveness across different targets.
Exploit Unicode normalization differences between input validators and LLM tokenizers to bypass content filters and inject hidden instructions.