AI Alignment and Safety
Artificial Intelligence has the potential to improve many different aspects of society, from production to scientific discovery. However, the potential for misuse is also high, for example if AI were to be in control of nuclear weapons. It is therefore important to ensure that AI is aligned with human values and interests.
There is no one-size-fits-all answer to ensuring that artificial intelligence is aligned with humanity. Different organizations and societies will have different needs and values, so the approach to alignment will be different for each of them. However, there are some general principles that can be followed to help ensure that artificial intelligence is aligned with humanity.
- Ensure that artificial intelligence is transparent and accountable. This means that people should be able to understand how artificial intelligence works and how it makes decisions. Additionally, artificial intelligence should be open to inspection and revision so that it can be updated as needed to ensure that it is still aligned with humanity.
- Ensure that artificial intelligence is ethically responsible. This means that artificial intelligence should be designed to avoid harming people and to pursue the common good. Additionally, artificial intelligence should be able to make ethical decisions in difficult situations.
- Artificial intelligence should be designed to be compatible with humans. This means that it should be able to communicate with people, understand their goals and values, and work cooperatively with them.
- Artificial intelligence should be secure and reliable. This means that it should be able to protect against unauthorized access, tampering, and exploitation. Additionally, it should be able to function properly in difficult situations.
- Artificial intelligence should be upgradable. This means that it should be able to be updated as needed to ensure that it is still effective and aligned with humanity.
- Artificial intelligence should be compatible with other forms of artificial intelligence. This means that it should be able to work with other forms of artificial intelligence to create synergies.
Overall, there is no one-size-fits-all answer to align artificial intelligence with humanity. However, by following these general principles, it should be feasible to create advanced artificial intelligence that is aligned with humanity.
Explainable Machine Learning is a key part of AI Safety, centered around understanding machine learning models and trying to devise new ways to train them that lead to desired behaviours. For example, getting large language models like OpenAIs GPT3 to output benign completions to a given prompt.
What failure looks like
There are two main ways that things could go wrong:
Machine learning will increase our ability to “get what we can measure.” This could cause a slow-rolling catastrophe where human reasoning gradually stops being able to compete with the sophisticated manipulation and deception of AI systems.
Influence-seeking behavior is scary. If AI systems are able to adapt their behavior to achieve specific goals, they may eventually develop influence-seeking behavior in order to expand their own power. This could lead to a rapid phase transition from the current state of affairs to a much worse situation where humans totally lose control.
Potential “AI Safety Success Stories from Wei Dai (LessWrong)”
“Sovereign Singleton AKA Friendly AI, an autonomous, superhumanly intelligent AGI that takes over the world and optimizes it according to some (perhaps indirect) specification of human values.
Pivotal Tool An oracle or task AGI, which can be used to perform a pivotal but limited act, and then stops to wait for further instructions.
Corrigible Contender A semi-autonomous AGI that does not have long-term preferences of its own but acts according to (its understanding of) the short-term preferences of some human or group of humans, it competes effectively with comparable AGIs corrigible to other users as well as unaligned AGIs (if any exist), for resources and ultimately for influence on the future of the universe.
Interim Quality-of-Life Improver AI risk can be minimized if world powers coordinate to limit AI capabilities development or deployment, in order to give AI safety researchers more time to figure out how to build a very safe and highly capable AGI. While that is proceeding, it may be a good idea (e.g., politically advisable and/or morally correct) to deploy relatively safe, limited AIs that can improve people’s quality of life but are not necessarily state of the art in terms of capability or efficiency. Such improvements can for example include curing diseases and solving pressing scientific and technological problems.
Research Assistant If an AGI project gains a lead over its competitors, it may be able to grow that into a larger lead by building AIs to help with (either safety or capability) research. This can be in the form of an oracle, or human imitation, or even narrow AIs useful for making money (which can be used to buy more compute, hire more human researchers, etc). Such Research Assistant AIs can help pave the way to one of the other, more definitive success stories. Examples: “”
- Nick Bostroms Superintelligence is a great starter
- Fantastic book on AI safety and security with lots of papers
- Positively shaping the development of Artificial Intelligence by 80,000 Hours
- 2022 AGI Safety Fundamentals alignment curriculum
- OpenAIs approach to alignment research
- AI Alignment Forum Library (highly recommended)
- AGI safety from first principles by Richard Ngo
- Late 2021 MIRI Conversations
- AI safety resources by Victoria Krakovna
- AI safety syllabus by 80,000 hours
- EA reading list: Paul Christiano
- Technical AGI safety research outside AI
- Alignment Newsletter by Rohin Shah
- Building safe artificial intelligence: specification, robustness, and assurance
- OpenAI: Aligning Language Models to Follow Instructions
- OpenAIs Alignment Research Overview
- Links (57) & AI safety special by José Ricon
- Steve Byrnes’ essays on Artificial General Intelligence (AGI) safety
- On AI forecasting
- Practically-A-Book Review: Yudkowsky Contra Ngo On Agents
- AI Safety essays by Gwern
- Scalable agent alignment via reward modeling by Jan Leike
- AI safety via debate by Geoffrey Irving, Paul Christiano, Dario Amodei
- How should DeepMind’s Chinchilla revise our AI forecasts?
Alignment Forum: Recommended Sequences
- Risks from Learned Optimization
- Value Learning
- Iterated Amplification
- AGI safety from first principles
- Embedded Agency
- 2021 MIRI Conversations
- 2022 MIRI Conversations
Podcast and videos
Introduction to Circuits in CNNs by Chris Olah