Download PDFOpen PDF in browserRobust AI Safety FrameworksEasyChair Preprint 1359812 pages•Date: June 7, 2024AbstractAs artificial intelligence (AI) systems become increasingly advanced and capable, ensuring their safe and reliable operation has become a critical challenge. Robust AI Safety Frameworks aim to address this challenge by establishing principles, techniques, and governance structures to align AI systems with human values and preferences, make them more robust against unintended behaviors and negative outcomes, and enhance their transparency and interpretability. Key principles of Robust AI Safety Frameworks include AI value alignment, where systems are designed to reliably pursue intended goals that are well-aligned with human interests; AI robustness and stability, which involves techniques to make AI systems more resistant to reward hacking, distributional shift, and other failure modes; and AI transparency and interpretability, enabling a better understanding of how AI systems make decisions and behave. Keyphrases: Robust AI Safety Frameworks, corrigibility, robustness, transparency, value alignment
|