AI Safety
SafetyThe field of research and practice focused on building AI systems that are safe, reliable, controllable, and aligned with human values.
Full Explanation
AI safety encompasses near-term concerns (bias, misuse, copyright) and long-term concerns (misaligned superintelligent AI). Key organizations: Anthropic, OpenAI's safety team, DeepMind Safety, and Alignment Research Center. Current techniques include RLHF, red-teaming, constitutional AI, interpretability research, and scalable oversight. The pace of AI capability advancement makes safety research increasingly urgent.
Related Terms
The challenge of ensuring AI systems behave in accordance with human values, intentions, and societal well-being.
Reinforcement Learning from Human Feedback — a training technique used to align AI models with human preferences and values.
Anthropic's technique for training Claude to be helpful, harmless, and honest by having AI models critique and revise their own outputs based on a set of principles.