The foundations of AI safety

MIRI’s mission is to ensure that the creation of smarter-than-human intelligence has a positive impact. We aim to make intelligent machines behave as we intend even in the absence of immediate human supervision. Much of our current research deals with reflection, an AI’s ability to reason about its own behavior in a principled rather than ad-hoc way. We focus our research on AI approaches that can be made transparent (e.g. principled decision algorithms, not genetic algorithms), so that humans can understand why the AIs behave as they do.

Computational Self-Reflection

How can we develop a rigorous basis for reflective reasoning?

A trustworthy machine intelligence will need to reason coherently about its own behavior and about the behavior of similar agents, but such reflection runs into fundamental difficulties articulated by Gödel, Tarski, Löb, and others. At MIRI, we aim to develop a rigorous (rather than ad-hoc) basis for reflective reasoning.

Problems of Self-Reference in Self-Improving Space-Time Embedded Intelligence

By considering agents to be a part of their environment, Orseau and Ring’s space-time embedded intelligence is a better fit for the real world than the traditional agent framework. However, a self-modifying AGI that sees future versions of itself as an ordinary part of the environment may run into problems of self-reference. We show that in one particular model based on formal logic, naive approaches either lead to incorrect reasoning that allows an agent to put off an important task forever (the procrastination paradox), or fail to allow the agent to justify even obviously safe rewrites (the Löbian obstacle). We argue that these problems have relevance beyond our particular formalism, and discuss partial solutions.

Definability of Truth in Probabilistic Logic

Alfred Tarski showed that it is impossible for any expressive formal language to contain its own truth predicate. We show that an expressive formal language can nevertheless contain its own “subjective probability” function. The assigned probabilities can be reflectively consistent in the sense of an appropriate analog of the reflection property. In practice, most meaningful assertions must already be treated probabilistically, and very little is lost by allowing some sentences to have probabilities intermediate between 0 and 1.

Decision Procedures

How can we develop a better formal foundation for decision making under uncertainty?

Standard decision procedures are not well-specified enough (e.g., with regard to counterfactuals) to be instantiated as algorithms. These procedures also tend to be inconsistent under reflection: an agent that initially uses causal decision theory will regret doing so, and will attempt to change its own decision procedure. When building trustworthy AIs, we’ll need to develop a better formal foundation for decision making under uncertainty.

Robust Cooperation on the Prisoner’s Dilemma: Program Equilibrium via Provability Logic

Classical game-theoretic agents defect in the Prisoner’s Dilemma even though mutual cooperation would yield higher utility for both agents. Moshe Tennenholtz showed that if each program is allowed to pass its playing strategy to all other players, some programs can then cooperate on the one-shot prisoner’s dilemma. Program equilibria is Tennenholtz’s term for Nash equilibria in a context where programs can pass their playing strategies to the other players.

One weakness of this approach so far has been that any two programs which make different choices cannot “recognize” each other for mutual cooperation, even if they are functionally identical. In this paper, provability logic is used to enable a more flexible and secure form of mutual cooperation.

A Comparison of Decision Algorithms on Newcomblike Problems

When formulated using Bayesian networks, two standard decision procedures can be shown to fail systematically when faced with aspects of the prisoner’s dilemma and so-called “Newcomblike” problems. We describe a new form of decision procedure, called Timeless Decision Theory, which consistently wins on these problems.

Value Functions

How can we formally specify an AI’s goals, such that the formalism matches our intentions? How can we ensure those intended goals are preserved even as an AI modifies itself?

Ontological Crises in Artificial Agents’ Value Systems

Decision-theoretic agents predict and evaluate the results of their actions using a model of their environment. An agent’s goal, or utility function, may also be specified in terms of the entities within its model. If the agent may upgrade or replace its model, it faces a crisis: the agent’s original goal may not be well-defined with respect to its new model. This crisis must be resolved before the agent can make plans towards achieving its goals. We show a solution to the problem for a limited class of models, and suggest a way forward for finding solutions for broader classes of models.

Intelligence Explosion and Machine Ethics

Evidence from the psychology of motivation, moral psychology, and neuroeconomics suggests that human values are complex and fragile, and thus it is difficult to formally specify an AI’s goals such that the formalism matches our intentions. After examining several approaches to solving this problem, we recommend “ideal preference” approaches as most promising for further development.

Forecasting

What can we predict about future AI? Which interventions appear to be the most beneficial?

In addition to our mathematical research, MIRI investigates important strategic questions, such as: What can (and can’t) we predict about future AI? How can we improve our forecasting ability? Which interventions available today appear to be the most beneficial, given what little we do know?

Intelligence Explosion Microeconomics

I.J. Good suggested that a sufficiently advanced machine intelligence could build a smarter version of itself, which could in turn build an even smarter version, and that this process could continue to the point of vastly surpassing human capability. How can we model and test this hypothesis? We identify the key issue as returns on cognitive reinvestment—the ability to invest more computing power, faster computers, or improved cognitive algorithms to yield cognitive labor which produces larger brains, faster brains, or better mind designs. Many phenomena have been claimed as evidence for various positions in this debate, from the observed course of hominid evolution to Moore’s Law to the competence over time of chess programs. This paper explores issues that arise when trying to interpret this evidence in light of Good’s hypothesis, and proposes that the next step in this research is to formalize return-on-investment curves, so that each position can formally state which models they hold to be falsified by historical observations.

How We’re Predicting AI—or Failing To

There are strong theoretical reasons to expect timeline forecasts of human-level AI to be quite poor. Moreover, a database of 95 historical AI timeline forecasts demonstrates that these expectations are borne out in practice: expert predictions contradict each other considerably, and are indistinguishable from non-expert predictions and past failed predictions. Predictions that AI lies 15 to 25 years in the future are the most common, from experts and non-experts alike.

Intelligence Explosion: Evidence and Import

We review the evidence for and against three claims: (1) there is a substantial chance we will create human-level AI before 2100, that (2) if human-level AI is created, there is a good chance vastly superhuman AI will follow via an “intelligence explosion,” and that (3) an uncontrolled intelligence explosion could destroy everything we value, but a controlled intelligence explosion would benefit humanity enormously if we can achieve it. The paper concludes with recommendations for increasing the odds of a controlled intelligence explosion relative to an uncontrolled intelligence explosion.