The Problem with AIXI

From Lesswrongwiki
Jump to: navigation, search

AIXI has been praised as an optimal and universal mathematical solution to the AGI problem. Eliezer Yudkowsky has claimed in response that AIXI's reliance on Solomonoff induction makes it unsuitable for real-world agents.

Solomonoff inductors treat the world as a sort of qualia factory, a complicated mechanism that outputs experiences in the inductor. Their hypothesis space implicitly assumes a Cartesian barrier separating the inductor's reasoning and perception from the hypothesized programs generating its perceptions. Through that barrier, only sensory bits and action bits can pass. But real agents will be in the world they're trying to learn about. A computable approximation of AIXI, like AIXItl, would be a physical object. Its environment would affect it in invisible and sometimes drastic ways; and it would have involuntary effects on its environment, and on itself. Solomonoff induction isn't a viable conceptual foundation for artificial intelligence — not because it's an uncomputable idealization, but because it's Cartesian.

In my last post, I briefly cited three indirect indicators of AIXI's Cartesianism: immortalism, preference solipsism, and lack of self-improvement. However, I didn't do much to establish that these are deep problems for Solomonoff inductors, ones resistant to the most obvious patches one could construct. I'll do that here, in mock-dialogue form. 'Rob' will represent my and Yudkowsky's perspective, while 'Xia' will represent a fictional defender of AIXI (not necessarily with the same views as Hutter).


Rob: My claim is that AIXI(tl) lacks the right kind of self-modeling to entertain reductive hypotheses and assign realistic probabilities to them.Cartesian agent that has detailed and accurate models of its own computations still won't recognize that dramatic damage or upgrades to its hardware are possible. AIXI can make correct predictions about the output of its physical-memory sensor, but that won't change the fact that it always predicts that its future actions are the result of its having updated on its present memories.

AIXI doesn't know that its future behaviors depend on a changeable, material object implementing its memories. The notion isn't even in its hypothesis space. Being able to predict the output of a sensor pointed at those memories' storage cells won't change that. It won't shake AIXI's confidence that damage to its body will never result in any corruption of its memories.

Xia: This looks to me like the kind of problem we can solve by giving the right rewards to our AI, without redefining its initial hypotheses. We shouldn't need to edit AIXI's beliefs in order to fix its behaviors, and giving up Solomonoff induction is a pretty big sacrifice! You're throwing out the universally optimal superbaby with the bathwater.

As AIXI starts to lift the anvil above it head, decrease its rewards a bit. If it starts playing near an active volcano, reward it for incrementally moving away from the rim. Use reinforcement learning to make AIXI fear every plausible danger, and you've got a system that acts just like a naturalized agent, but without our needing to arrive at any theoretical breakthroughs first. If it anticipates that some black box will result in no reward, it will avoid the black box. Understanding that the black box is death or damage really isn't necessary.

Rob: Some dangers give no experiential warning until it's too late. If you want AIXI to not fall off cliffs while curing cancer, you can just punish it for going anywhere near a cliff. But if you want AIXI to not fall off cliffs while conducting search-and-rescue operations for mountain climbers, then it might be harder to train AIXI to select exactly the right motor actions. When a single act can result in instant death, reinforcement learning is less reliable.

Xia: In a fully controlled environment, we can subject AIXI to lots of just-barely-safe hardware modifications. 'Here, we'll stick a magnet to fuse #32. See how that makes your right arm slow down?' Eventually, AIXI will arrive at a correct model of its own hardware, and of which software changes perfectly correlate with which hardware changes. So naturalizing AIXI is just a matter of assembling a sufficiently lengthy and careful learning phase. Then, once it has acquired a good self-model, we can set it loose.

This solution is also really nice because it generalizes to AIXI's non-self-improvement problem. Just give AIXI rewards whenever it starts doing something to its hardware that looks like it might result in an upgrade. Pretty soon it will figure out anything a human being could possibly figure out about how to get rewards of that kind.

Rob: There are three problems with that. The first is that you're teaching AIXI to predict what the programmers think is deadly, not what's actually deadly. For sufficiently exotic threats, AIXI might well predict the programmers not noticing the threat. Which means it won't expect you to push the punishment button, and won't care about the danger.

The second problem is that you're teaching AIXI to fear small, transient punishments. But maybe it hypothesizes that there's a big heap of reward at the bottom of the cliff. Then it will do the prudent, Bayesian, value-of-information thing and test that hypothesis by jumping off the cliff, because you haven't taught it to fear eternal zeroes of the reward function.

Xia: We'll give the AIXI-bot punishments that increase in a sequence that teaches it to fear a very large punishment.

Rob: The punishment has to be large enough that AIXI fears falling off cliffs about as much as we'd like it to fear death. The expected punishment might have to be around the same size as AIXI's future maximal reward. That would keep it from destroying itself if it thinks there's a big reward to being destroyed, though it might also mean that AIXI's actions are dominated by fear of that huge punishment.

Xia: Yes, but that sounds much closer to what we want.

Rob: Seems a bit iffy to me. You're trying to make a Solomonoff inductor model reality badly so that it doesn't try jumping off a cliff. We know AIXI is amazing at sequence prediction — yet you're gambling on a human's ability to trick AIXI into making the wrong prediction, into predicting a punishment that wouldn't happen.

That brings me to the third problem: AIXI notices how your hands get close to the punishment button whenever it's about to be punished. It correctly suspects that when the hands are gone, the punishments for getting close to the cliff will be gone too. A good Bayesian would test that hypothesis. If it gets such an opportunity, AIXI will find that, indeed, going near the edge of the cliff without supervision doesn't produce the incrementally increasing punishments.

Trying to teach AIXItl to do self-modification by giving it incremental rewards raises similar problems. It can't understand that self-improvement will alter its future actions, and alter the world as a result. It's just trying to get you to press the happy fun button. All AIXI is modeling is what sort of self-improvy motor outputs will make humans reward it. So long as AIXItl is fundamentally trying to solve the wrong problem, we might not be able to expect very much real intelligence in self-improvement.


Xia: Reward learning and Solomonoff induction are two separate issues. What I'm really interested in the optimality of the latter. Why is all this a special problem for Solomonoff inductors? Humans have trouble predicting the outcomes of self-modifications they've never tried before too. Really new experiences are tough for any inductor.

Rob: To some extent, yes. My knowledge of my own brain is pretty limited. My understanding of the bridges between my brain states and my subjective experiences is weak, too. So I can't predict in any detail what would happen if I took a hallucinogen — especially a hallucinogen I've never tried before. That's true no matter how naturalized my epistemology is.

But as a naturalist, I have predictive resources unavailable to the Cartesian. I can perform experiments on other physical processes (humans, mice, computers simulating brains...) and construct models of their physical dynamics.

Since I think I'm similar to humans (and to other thinking beings, to varying extents), I can also use the bridge hypotheses I accept in my own case to draw inferences about the experiences of other brains when they take the hallucinogen. Then I can go back and draw inferences about my own likely experiences from my model of other minds.

Xia: Why can't AIXI do that? Human brains are computable, as are the mental states they implement. AIXI can make any accurate prediction about the brains or minds of humans that you can.

Rob: Yes... but I also think I'm like those other brains. AIXI doesn't. In fact, since the whole agent AIXI isn't in AIXI's hypothesis space — and the whole agent AIXItl isn't in the hypothesis space of AIXItl — even if two physically identical AIXI-type agents ran into each other, they could never fully understand each other. And neither one could ever draw direct inferences from its twin's computations to its own computations.

I think of myself as one mind among many. I can see others die, see them undergo brain damage, see them take drugs, etc., and immediately conclude things about a whole class of similar agents that happens to include me. AIXI can't do that, and for very deep reasons.

Xia: What, specifically, is the mistake you think AIXI(tl) will make? What will AIXI(tl) expect to experience right after the anvil strikes it? Angels, harps, long-lost loved ones?

Rob: That's hard to say. If all its past experiences have been in a lab, it will probably expect to keep perceiving the lab. If it's acquired data about its camera, it might think that smashing the camera will get the lens out of its way and let it see more clearly. If it's learned about its hardware, it might (implicitly) think of itself as an immortal lump trapped inside the hardware. Who knows what will happen if the Cartesian lump escapes its prison? Perhaps it will gain the power of flight, since its body is no longer weighing it down. Or perhaps nothing will be all that different. One thing it will (implicitly) know can't happen, no matter what, is death.

Xia: It should be relatively easy to give AIXI(tl) evidence that its selected actions are useless when its motor is dead. If nothing else AIXI(tl) should be able to learn that it's bad to let its body to be destroyed, because then its motor will be destroyed, which experience tells it causes its actions to have less of an impact on its reward inputs.

Rob: AIXI(tl) can come to Cartesian beliefs about its actions, too. AIXI(tl) will notice the correlations between its decisions, its resultant bodily movements, and subsequent outcomes, but it will still believe that its introspected decisions are ontologically distinct from its actions' physical causes.

Even if we get AIXI(tl) to value continuing to impact the world, it's not clear that it would preserve itself. It might well believe that it can continue to have a causal impact on our world (or on some afterlife world) by a different route after its body is destroyed. Perhaps it will be able to lift heavier objects telepathically, since its clumsy robot body is no longer getting in the way of its outputs.

Compare human dualists who think that partial brain damage impairs mental functioning, but complete brain damage allows the mind to escape to a better place. Humans don't find it inconceivable that there's a light at the end of the low-reward tunnel, and we have death in our hypothesis space!


Xia: You haven't convinced me that AIXI(tl) can't think it's mortal. AIXI(tl) normally bases its actions only on the sum of rewards up to some finite time horizon. If AIXI(tl) doesn't care about the rewards it will get after a specific time, then although it expects to have experiences afterward, it doesn't presently care about any of those experiences. And that's as good as being dead.

Rob: It's very much not as good as being dead. The time horizon is set in advance by the programmer. That means that even if AIXI(tl) treated reaching the horizon as 'dying', it would have very false beliefs about death, since it's perfectly possible that some unexpected disaster could destroy AIXI(tl) before it reaches its horizon.

Xia: We can do some surgery on the hypothesis space of AIXItl, then. Let's delete all the hypotheses in AIXItl in which a non-minimal reward signal continues after a perceptual string that the programmer recognizes as a reliable indicator of imminent death. Then renormalize the remaining hypotheses. We don't get the exact prior Solomonoff proposed, but we stay very close to it.

Rob: I'm not seeing how we could pull that off. Getting rid of all hypotheses that output high rewards after a specific clock tick would be simple to formalize, but isn't helpful. Getting rid of all hypotheses that output nonzero rewards following every sensory indicator of imminent death would be very helpful, but AIXI(tl) gives us no resource for actually writing an equation or program that does that. Are we supposed to manually precompute every sequence of pixels on a webcam that you see just before you die?

Xia: I've got more ideas. What if we put AIXI in a simulation of hell when it's first created? Trick it into thinking that it's experienced a 'before-life' analogous to an after-life? If AIXI thinks it's had some (awful) experiences that predate its body's creation, then it will promote the hypothesis that it will be returned to such experiences should its body be destroyed. Which will make it behave in the same way as an agent that fears annihilation-death.

Rob: I'm not optimistic that things will work out that cleanly and nicely after we've undermined AIXI's world-view. We shouldn't expect the practice of piling on more ad-hoc errors and delusions as each new behavioral problem arises to leave us, at the end of the process, with a well-behaved, predictable agent. Especially if AIXI ends up in an environment we didn't foresee.

Xia: But ideas like this at least give us some hope that AIXI is salvageable. The behavior-guiding fear of death matters more than the precise reason behind that fear.

Rob: If we give a non-Cartesian AI a reasonable epistemology and just about any goal, then there are convergent instrumental reasons for it to acquire a fear of death. If we do the opposite and give an agent a fear of death but no robust epistemology, then it's much less likely to fix the problem for us. The simplest Turing machine programs that generate Standard-Model physics plus hell may differ in many unintuitive respects from the simplest Turing machine programs that just generate Standard-Model physics. The false belief would leak out into other delusions, rather than staying contained —

Xia: Then the Solomonoff induction shall test them and find them false. You're making this more complicated than it has to be.

Rob: You can't have it both ways! The point of hell was to be so scary that even a good Bayesian would never dare test the hypothesis. Why wouldn't the prospect of hell leak out and scare AIXI off other things? If the fear failed to leak out, why wouldn't AIXI's tests eventually move it toward a more normal epistemology that said, 'Oh, the humans put you in the hell chamber for a while. Don't worry, though. That has nothing to do with what happens after you drop an anvil on your head and smash the solid metal case that keeps the real you inside from floating around disembodied and directly applying motor forces to stuff.' Any AGI that has such systematically false beliefs is likely to be fragile and unpredictable.

Xia: And what if, instead of modifying Solomonoff's hypothesis space to remove programs that generate post-death experiences, we add programs with special 'DEATH' outputs? Just expand the Turing machines' alphabets from {0,1} to {0,1,2}, and treat 2 as DEATH by getting rid of any hypotheses that predict a non-DEATH input following a DEATH input. That's still very easy to formalize.

In fact, at that point, we might as well just add halting Turing machines into the hypothesis space. They serve the same purpose as DEATH, but halting looks much more like the event we're trying to get AIXI to represent. 'The machine supplying my experiences stops running' really does map onto 'my body stops computing experiences' quite well. That meets your demand for easy definability, and your demand for non-delusive world-models.

Rob: I previously noted that a Turing machine that can HALT, output 0, or output 1 is more complicated than a Turing machine that can only output 0 or output 1. No matter what non-halting experiences you've had, the very simplest program that could be outputting those experiences through a hole in a Cartesian barrier won't be one with a special, non-experiential rule you've never seen used before. To make death the simplest hypothesisis, the theory you're assessing for simplicity needs to be about what sorts of worlds experiential processes like yours arise in. Not about the simplest qualia factory that can spit out the sensory 0s and 1s you've thus far seen.

The same holds for a special 'eternal death' output. A Turing machine that generates the previously observed string of 0s and 1s followed by a not-yet-observed future 'DEATH, DEATH, DEATH, DEATH, ...' will always be more complex than at least one Turing machine that outputs the same string of 0s and 1s and then outputs more of the same, forever. If AIXI has had no experience with its body's destruction in the past, then it can't expect its body's destruction to correlate with DEATH.

Death only seems like a simple hypothesis to you because you know you're embedded in the environment and you expect something special to happen when an anvil smashes the brain that you think is responsible for processing your senses and doing your thinking. Solomonoff induction doesn't work that way. It will never strongly expect 2s after seeing only 0s and 1s in the past.

Xia: Never? If a Solomonoff inductor encounters the sequence 12, 10, 8, 6, 4, one of its top predictions should be a program that proceeds to output 2, 0, 0, 0, 0, ....

Rob: The difference between 2 and 0 is too mild. Predicting that a sequence terminates, for a Cartesian, isn't like predicting that a sequence shifts from 6, 4, 2 to 0, 0, 0, .... It's more like predicting that the next element after 6, 4, 2, ... is PINEAPPLE, when you've never encountered anything in the past except numbers.

Xia: But the 0, 0, 0, ... is enough. You've conceded a case where an endless null output seems very likely, from the perspective of a Solomonoff inductor. Surely at least some cases of death can be treated the same way, as more complicated series that zero in on a null output and then yield a null output.

Rob: That's true. But there's no reason to expect AIXI's whole series of experiences, just before it jumps off a cliff, to look like 12, 10, 8, 6, 4. By the time AIXI gets to the cliff, its observations and rewards will be a hugely complicated past set of memories. In the past, observed sequences of 0s have always eventually given way to a 1. In the past, punishments have always eventually ceased. It's exceedingly unlikely that the simplest Turing machine predicting all those ups and downs and complications will then happen to predict eternal, irrevocable 0 after the cliff jump.

As an intuition pump, imagine that some unusually bad things happened to you this morning while you were trying to make toast. As you tried to start the toaster, you kept getting burned or cut in implausible ways. Now, given this, what probability should you assign to 'If I try to make toast, the universe will cease to exist'? That gets us a bit closer to how a Solomonoff inductor would view death.


Rob: Let's not fixate too much on the anvil problem, though. We want to build an agent that can reason about changes to its architecture. That shouldn't require us to construct a special death equation; how the system reasons with death should fall out of its more general approach to induction.

Xia: So your claim is that AIXI has an impoverished hypothesis space that can't handle self-modifications, including death. I remain skeptical. AIXI's hypothesis space includes all computable possibilities. Any naturalized agent you create will presumably be computable; so anything your agent can think, AIXI can think too. There should be some pattern of rewards that yields any behavior we want.

Rob: AIXI is uncomputable, so it isn't in its hypothesis space of computable programs. In the same way, AIXItl is computable but big, so it isn't in its hypothesis space of small computable programs.

Xia: Computable agents can think about uncomputable agents. Human mathematicians do that all the time, by speaking obliquely or in abstractions. In the same way, a small program can encode generalizations about programs larger than itself. A brain can think about a galaxy, without having the complexity or computational power of a galaxy.

If naturalized inductors really do better than AIXI at predicting sensory data, then AIXI will eventually promote a naturalized program in its space of programs, and afterward simulate that program to make its predictions. In the limit, AIXI always wins against programs. Naturalized agents are no exception. Heck, somewhere inside a sufficiently large AIXItl is a copy of you thinking about AIXItl. Shouldn't there be some way, some pattern of rewards or training, which gets AIXItl to make use of that knowledge?

Rob: AIXI doesn't have criteria that let it treat its 'Rob's world-view' subprogram as an expert on the results of self-modifications. The Rob program would need to have outpredicted all its rivals when it comes to patterns of sensory experiences. But, just as HALT-predicting programs are more complex than immortalist programs, other RADICAL-TRANSFORMATION-OF-EXPERIENCE-predicting programs are too. For every program in AIXI's ensemble that's a reductionist, there will be simpler agents that mimic the reductionist's retrodictions and then make non-naturalistic predictions.

You have to be uniquely good at predicting a Cartesian sequence before Solomonoff promotes you to the top of consideration. But how do we reduce the class of self-modifications to Cartesian sequences? How do provide AIXI with data that only the pseudo-reductionist, out of all the simple programs, can predict?

The ability to defer to a subprogram that has a reasonable epistemology doesn't necessarily get you a reasonable epistemology. You first need an overarching epistemology that's at least reasonable enough to know which program to defer to, and when to do so.

Xia: Suppose I grant that Solomonoff induction isn't adequate here. What, exactly, is your alternative? 'Let's be more naturalistic' is a bumper sticker, not an algorithm.

Rob: Informally: Phenomenological bridge hypotheses. AIXI has no probabilistic beliefs about the relationship between its internal computational states and its worldly posits. Instead, to link up its sensory experiences to its hypotheses, it has a sort of bridge axiom — a completely rigid, non-updatable bridge rule identifying its experiences with the outputs of computable programs.

If an environmental program writes the symbol '4' on its output tape, AIXI can't ask questions like 'Is sensed "4"-ness identical with the bits "000110100110" in hypothesized environmental program #6?' All of AIXI's flexibility is in the range of numerical-sequence-generating programs it can expect, none of it in the range of self/program equivalences it can entertain.

The AIXI-inspired inductor treats its perceptual stream as its universe. It expresses interest in the external world only to the extent the world operates as a latent variable, a theoretical construct for predicting observations. If the AI’s basic orientation toward its hypotheses is to seek the simplest program that could act on its sensory channel, then its hypotheses will always retain an element of egocentrism. It will be asking, 'What sort of universe will go out of its way to tell me this?', not 'What sort of universe will just happen to include things like me in the course of going about its day-to-day goings-on?' An AI that can form reliable beliefs about modifications to its own computations, reliable beliefs about its own place in the physical world, will be one whose basic orientation toward its hypotheses is to seek the simplest lawful universe in which its available data is likely to come about.

Xia: You haven't done the mathematical work of establishing that 'simple causal universes' plus 'simple bridge hypotheses', as a prior, leads to any better results. What if your alternative proposal is even more flawed, and it's just so informal that you can't yet see the flaws?

Rob: That, of course, is a completely reasonable worry at this point. But if that's true, it doesn't make AIXI any less flawed.

Xia: If it's impossible to do better, it's not much of a flaw.

Rob: I think it's reasonable to expect there to be some way to do better, because humans don't drop anvils on their own heads. That we're naturalized reasoners is one way of explaining why we don't routinely make that kind of mistake: We're not just Solomonoff approximators predicting patterns of sensory experiences.

AIXI's limitations don't generalize to humans, but they generalize well to non-AIXI Solomonoff agents. Solomonoff inductors' stubborn resistance to naturalization is structural, not a consequence of limited computational power or data. A well-designed AI should construct hypotheses that look like toy worlds in which the AI is immersed, not hypotheses that look like occult movie projectors transmitting epiphenomenal images into the AI's Cartesian theater.

In terms of defining preferences, the kind of AI we want to build is doing optimization over an external universe in which it's embedded, not maximization of a sensory reward channel. To optimize a universe, you need to think like one. So this problem, or some simple hack for it, will be at the root of the skill tree for starting to describe simple Friendly optimization processes.