Robust AI Safety Frameworks

EasyChair Preprint 13599

12 pages•Date: June 7, 2024

Abstract

As artificial intelligence (AI) systems become increasingly advanced and capable, ensuring their safe and reliable operation has become a critical challenge. Robust AI Safety Frameworks aim to address this challenge by establishing principles, techniques, and governance structures to align AI systems with human values and preferences, make them more robust against unintended behaviors and negative outcomes, and enhance their transparency and interpretability.

Key principles of Robust AI Safety Frameworks include AI value alignment, where systems are designed to reliably pursue intended goals that are well-aligned with human interests; AI robustness and stability, which involves techniques to make AI systems more resistant to reward hacking, distributional shift, and other failure modes; and AI transparency and interpretability, enabling a better understanding of how AI systems make decisions and behave.

Keyphrases: Robust AI Safety Frameworks, corrigibility, robustness, transparency, value alignment

Links:

https://easychair.org/publications/preprint/F356

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:13599,
  author    = {Edwin Frank},
  title     = {Robust AI Safety Frameworks},
  howpublished = {EasyChair Preprint 13599},
  year      = {EasyChair, 2024}}

Download PDF Open PDF in browser