User:Cousin it/AI cooperation in practice

From LessWrong
Jump to navigation Jump to search

Problem 1: A program that proves statements about its own output

Consider a program that knows its own source code (by quining) and contains a strong enough proof checker. The program's algorithm is as follows: try all proofs of length up to nmax, looking for a proof that the program itself returns 1. If such a proof is found, return 1. Otherwise do the same for 2, then 3, then return 0.

The question is: what will the program return?

Solution to problem 1

I believe that the program will in fact return 1. Because the proof is quite involved, I will first write out the program itself, using a JavaScript-like syntax, with large parts snipped.

   function eval(code)
   {
       // Runs code and returns whatever it evaluates to.
       // Uses the definitions of other functions directly.
       ...
   }
   function proves(code, maxsteps)
   {
       // Enumerates all proofs up to maxsteps, searching for a 
       // proof that code evaluates to true. Returns true if found.
       // Uses the definitions of other functions known by quining.
       ...
   }
   function outputs(x)
   {
       // Returns a string of code claiming that main returns x.
       return "main()==" + x;
   }
   function implies(bool1, bool2)
   {
       // Returns true if one boolean value implies another. 
       if (bool1)
           return bool2;
       else
           return true;
   }
   function main()
   {
       var nmax = ...; // A big number
       if (proves(outputs(1), nmax))
           return 1;
       if (proves(outputs(2), nmax))
           return 2;
       if (proves(outputs(3), nmax))
           return 3;
       return 0;
   }

Now for the actual proof. It will closely follow the proof of Lob's theorem, but add length limits where needed. Let n1, ... ,n6 be a suitably fast-growing sequence of numbers smaller than nmax. The number n1 should also be large enough to start with; the proof will show exactly how large. I will list twelve code snippets that are proved to return true by the program:

1. By the Diagonal Lemma, there exists a code string "box" such that the following will be proved:

  eval(box) == implies(
                   proves(box, n1),
                   eval(outputs(1)))

2. The program can inspect the source code of main() and notice that, if the first conditional happens to be true, the program will output 1:

  implies(
      proves(outputs(1), n6),
      eval(outputs(1)))

3. An obvious consequence of 1:

  implies(
      eval(box),
      implies(
          proves(box, n1),
          eval(outputs(1))))

4. If we can prove 3, we can prove that we can prove it:

  proves(
      "implies(
           eval(box),
           implies(
               proves(box, n1),
               eval(outputs(1))))",
       n2)

5. Pull "implies" outside of "proves", adding some length limits:

  implies(
      proves(box, n1),
      proves(
          "implies(
               proves(box, n1),
               eval(outputs(1)))",
          n3))

6. Push "proves" inside "implies", adding more limits:

  implies(
      proves(box, n1),
      implies(
          proves("proves(box, n1)", n4),
          proves(outputs(1), n5)))

7. If we know box has a short proof, that fact has a slightly longer proof:

  implies(
      proves(box, n1),
      proves("proves(box, n1)", n2))

8. Logic on 6 and 7, using the fact that n2 < n4:

  implies(
      proves(box, n1),
      proves(outputs(1), n5))

9. Logic on 8 and 2, using the fact that n5 < n6:

  implies(
      proves(box, n1),
      eval(outputs(1)))

10. By luck, 9 is exactly the definition of box from 1:

   eval(box)

11. If we can prove 10, we can prove that we can prove it. This is the crucial step that tells us how large n1 must be: large enough that 11 follows from 10.

   proves(box, n1)

12. Logic on 9 and 11:

   eval(outputs(1))

Now, If the proof checker can prove only true statements, this means the program will in fact output 1. QED.

Problem 2: Two programs that prove statements about each other

Consider two programs, each of which knows its own source code via quining. The programs are playing a variant of the Prisoner's Dilemma: each receives as input the source code of the other, and outputs true for Cooperate or false for Defect. The algorithm of each program is as follows: try all proofs up to a certain length (the proof system and the max length depending on the program), looking for a proof that both programs output the same value (i.e. that I cooperate if and only if the other guy cooperates). If a proof is found, the program outputs true, otherwise false.

The question is: will the two programs cooperate?

Solution to problem 2

Like in Problem 1, assume that the two programs A and B have parts named mainA and mainB, provesA and provesB, etc., accessible to both programs. Define a new function "proves" as follows:

  function proves(code, maxsteps)
  {
      return provesA(code, maxsteps) && provesB(code, maxsteps);
  }

Now use the same solution as in Problem 1, except in the beginning:

1. By the Diagonal Lemma, construct a code string "box" so that the following will be proved by both programs:

  eval(box) == implies(
                   proves(box, n1),
                   (mainA() == mainB()))

2. Both programs can prove the following by inspecting the source of A and B (recall that "proves" implies provesA and provesB):

  implies(
      proves("mainA() == mainB()", n6),
      (mainA() == mainB()))

Steps 3 through 12 are the same as in the solution to Problem 1. They use properties of the "proves" function that hold if both proof systems are strong enough - they don't have to be equivalent. Finally we see both programs will prove that mainA() == mainB(), which implies that they will both return true. QED.