submited by

Style Pass

(This is an unofficial explanation of Inner Alignment based on the Miri paper Risks from Learned Optimization in Advanced Machine Learning Systems (which is almost identical to the LW sequence) and the Future of Life podcast with Evan Hubinger (Miri/LW). It's meant for anyone who found the sequence too long/challenging/technical to read.)

If the problem is "find a tool that can look at any image and decide whether or not it contains a cat," then each conceivable set of rules for answering this question (formally, each function from the set of all pixels to the set { yes, no } ) defines one solution. We call each such solution a model. The space of possible models is depicted below.

Pick a random one, and you're as likely to end up with a car-recognizer than a cat-recognizer – but far more likely with an algorithm that does nothing we can interpret. Note that even the examples I annotated aren't typical – most models would be more complex while still doing nothing related to cats. Nonetheless, somewhere in there is a model that would do a decent job on our problem. In the above, that's the one that says, "I look for cats."

How does ML find such a model? One way that does not work is trying out all of them. That's because the space is too large: it might contain over 10 1000000 candidates. Instead, there's this thing called Stochastic Gradient Descent (SGD). Here's how it works:

Read more lesswrong.co...