We evolve environments at the frontier of a reinforcement learning agent's capabilities, leading to self-supervised teacher-student processes with strong zero-shot generalization results for agents learning to walk through challenging terrain and navigating complex human-designed mazes.
Deep reinforcement learning (RL) has seen tremendous success over the past decade. However, agents trained on fixed environments are brittle, often failing the moment the environment changes even slightly, thus limiting the real-world applicability of current RL methods. A common remedy is to introduce more training data diversity by randomizing the environment’s parameters in every episode—a process called domain randomization (DR). For example, these parameters might control the friction coefficient and lighting conditions in a robotic arm simulation or the position of obstacles in a maze (we call each such environment variation a level). Procedurally generating training levels to expose RL agents to more diverse experiences has quickly become a common method for improving the robustness of RL agents. However, DR is often not enough to train robust agents in domains where the agent struggles to make progress on many challenging levels.
Adaptive curriculum methods match the complexity of training levels to an agent’s current capabilities, and have been shown to produce more robust policies in fewer training steps than domain randomization. These methods can be viewed as a game between a teacher that designs challenging levels and a student that solves them. This game is potentially open-ended, leading to the co-evolution of generally-capable students. By tailoring the distribution of entire levels throughout training, adaptive curricula perform unsupervised environment design (UED). However, training an environment-designing teacher is difficult, and the prior state-of-the-art UED method, Prioritized Level Replay (PLR), simply finds challenging levels through random search, making it unable to build off of previously found structures that were useful for training agents in the past. We can also expect its performance to degrade as the size of the design space grows, limiting the potential for open-ended co-evolution between teacher and student. In contrast, the evolutionary processes between organisms and their environments, which UED resembles, efficiently search the design space by successively mutating a population and selecting the fittest individuals.