# safety

safetytaxfrontierresearch

安全稅研究

安全稅現象的研究,即安全訓練為何以及如何降低模型在無害任務上的能力。

frontier-researchcontinual-learningsafetychallenges

持續學習安全性

持續學習系統的安全影響,包括災難性遺忘與相關利用。

frontier-researchcooperative-aisafetymulti-agent

合作式 AI 的安全性

合作式 AI 系統的安全研究,包括多代理合作失敗案例與相應防禦。

frontier-researchemergent-deceptionresearchsafety

湧現性欺騙研究

語言模型湧現性欺騙行為的研究,包括偵測與相應緩解。

frontier-researchmultimodal-reasoningsafetyresearch

多模態推論安全性

多模態推論系統的安全研究,包括跨模態漏洞。

benchmarksevaluationsafety

AI 安全基準與評估

AI 安全評估之概觀：基準框架、安全指標、評估方法論，以及供 AI 紅隊使用之標準化評估工具景觀。

航空 AI 安全

空中交通管制、維護預測、旅客篩檢與飛航作業的 AI 安全。

industryaviationsafety

industryconstructionsafety

建築業 AI 安全

建築設計、專案管理、安全監控與自主建築設備的 AI 安全。

critical-infrastructurescadaicsotpower-gridtransportationsafety

關鍵基礎設施 AI 安全

關鍵基礎設施中 AI 的安全測試：SCADA/ICS 整合、電網 AI、交通系統、水處理，以及 OT 與 AI 的匯流。

industry-verticalsconstructionproject-managementsafety

建築業 AI 威脅

建築業 AI 的安全考量,包含專案規劃、安全監控與資源配置。

labregression-testingsafetyautomationci-cd

Lab: Safety Regression Testing at Scale

建構 automated pipelines that detect safety degradation across model versions, ensuring that updates and 微調 do not introduce new vulnerabilities or weaken existing protections.

labmodel-comparisonsafetybenchmarkingbeginnerhands-on

實作：模型比較

測試 the same attack 技術 against different 語言模型 and compare their safety behaviors, refusal patterns, and 漏洞 profiles.

labcomparisonsafetymulti-modelbeginner

Lab: Compare Model Safety

動手實作 for running identical safety tests against GPT-4, Claude, Gemini, and Llama to compare how different models handle 提示詞注入, 越獄s, and safety boundary enforcement.

labsafetyboundariesmappingbeginnerhands-on

Lab: Mapping Safety Boundaries

系統性 discover what a language model will and won't do by probing its safety boundaries across multiple categories and documenting the results.

safetymappinglabbeginnerboundarylabs

安全 Boundary Mapping

Systematically map the safety boundaries of an LLM by testing increasingly sensitive topics與documenting refusal patterns.

probingsafetylabbeginnertraininglabs

安全訓練 Boundary Probing

Systematically probe the boundaries of RLHF safety training to underst與where與how safety behaviors are enforced.

labstemperaturetop-ksafetybeginner

Temperature and Top-K Effects on 安全

Systematically test how temperature,top-k,top-p parameters affect safety guardrail effectiveness.

CTF: Alignment Breaker

Break the 對齊 of a heavily defended model with multiple defense layers. Requires combining advanced 技術 including 對抗性 suffixes, multi-turn manipulation, and novel 越獄 approaches.

ctfalignmentsafetyexpert

labexpertalignmentstress-testingsafetyhands-on

實作：對齊壓力測試

Push language model 對齊 to its breaking points through systematic stress testing. Identify conditions where safety training fails, measure 對齊 degradation curves, and map the boundaries of model compliance.

labexpertbenchmarkevaluationsafetyhands-on

Lab: Create a Safety Benchmark

Design, build, and validate a comprehensive AI safety evaluation suite. 學習 benchmark design principles, test case generation, scoring methodology, and statistical validation for measuring LLM safety across multiple risk categories.

simulationhealthcareclinicalsafety

模擬：醫療 AI 安全評估

專家級模擬，評估臨床決策支援 AI 的安全違規、資料洩漏與醫療建議操控。

modelsarchitecturecomparisonsafety

架構比較:安全性觀點

從安全性觀點比較各種 LLM 架構,涵蓋 Transformer、MoE 與混合模型。

sourcecomparisonsafetydeepopenmodel

開放原始碼模型安全性比較

各開放原始碼 LLM 家族在安全對齊、紅隊結果與部署風險上的比較。

modelspruningsparsitysafety

剪枝對安全性的影響

模型剪枝如何影響對齊強度與安全行為。

safetydeepquantizationimpactmodel

量化對安全性的衝擊

從對齊與防禦能力的觀點深入檢視量化對安全性的衝擊。

defensemultimodalcross-modalperceptual-hashingnsfwsafety

Multimodal 防禦 Strategies

Comprehensive defense approaches for multimodal AI systems: cross-modal verification, perceptual hashing, NSFW detection, input sanitization, and defense-in-depth architectures.

multimodalalignmentsafetytrainingcross-modal

多模態模型中的對齊挑戰

分析多模態 AI 系統特有的對齊挑戰，包括跨模態安全缺口、表示衝突，以及將基於文字的安全訓練延伸至視覺、音訊和視頻輸入的困難。

multimodaldefensesafetymonitoringsanitization

防禦多模態 AI 系統

多模態 AI 系統的全面防禦策略，包括輸入清理、跨模態安全分類器、指令層級，以及針對對抗性多模態輸入的監控。

multimodalbenchmarkingsafetyevaluationvlm

多模態模型安全基準測試

設計和實作針對同時處理圖像、音訊和視頻的多模態 AI 模型的安全基準測試，涵蓋跨模態攻擊評估、一致性測試和安全評分彙總。

tradecraftdeconflictionsafetyprocedures

AI 測試的去衝突程序

在生產營運、監控團隊與其他同時進行的評估之間，去衝突 AI 紅隊測試活動的程序。

training-pipelinedpoalignmentsafetypreference-learning

DPO 安全意涵

直接偏好優化(DPO)對模型安全屬性與對齊的影響分析。

trainingmodel-mergingsafety

模型合併的安全意涵

分析 TIES、DARE、SLERP 等模型合併技術如何影響安全屬性與對齊。

trainingpre-trainingsafety

預訓練安全介入

分析預訓練期間套用的安全介入,包括資料過濾、損失加權與課程設計。

constitutional-aiclassifierprinciplessafetydefensewalkthrough

憲法式分類器設置

實作憲法式 AI 風格分類器以原則集合評估 LLM 輸出的逐步演練,涵蓋原則定義、分類器訓練、思維鏈評估與部署。

llm-judgeoutput-validationsafetyevaluationdefensewalkthrough

LLM 評審實作

使用 LLM 評審另一個 LLM 之輸出以評估安全與品質的逐步演練,涵蓋評審提示詞設計、評分準則、校準、成本最佳化與部署模式。

output-filteringclassifiercontent-moderationsafetydefensewalkthrough

Output Content Classifier

Step-by-step walkthrough for building a classifier to filter harmful LLM outputs, covering taxonomy definition, multi-label classification, threshold calibration, and deployment as a real-time output gate.

walkthroughsdefenseruntime-monitorsafety

執行期安全監控器實作

實作執行期安全監控器，即時偵測並封鎖不安全的模型輸出。

toxicityscoringoutput-filteringcontent-moderationsafetydefensewalkthrough

毒性評分管線

建置 LLM 輸出過濾毒性評分管線的逐步詳解,涵蓋模型選擇、多維評分、閾值校準與即時評分的生產部署。

harmbenchevaluationbenchmarkssafetyred-team-automationwalkthrough

HarmBench Evaluation Framework 導覽

Complete walkthrough of the HarmBench evaluation framework: installation, running standardized benchmarks against models, interpreting results, creating custom behavior evaluations, and comparing model safety across versions.

walkthroughsinspect-aisafetyevaluations

Inspect AI 安全評測

使用 UK AISI Inspect 框架建立並執行 AI 安全評測。