Blog post (featuring story of chess6x6 agent) - https://medium.com/@omikad/train-ai-to-play-chess-6x6-using-probs-algorithm-539e39a9dea6
This is an algorithm for solving board games: two player, deterministic games with full information. It is similar to AlphaZero, but uses much simpler beam search instead of MCTS. Goal of this work is to show that such simplification also work well and can be trained to get winning agent. I don't have enough computational resources to make a fair comparison with AlphaZero, so if you like to contribute - please email me (email is in paper).
Value model: given state s predict terminal reward [-1 ... 1] if both players play following policy derived from Q with added exploration noise
Execute a predefined number of self-play games. Select moves based on softmax(Q(s, a)) and Dirichlet noise to boost exploration. Save all games in experience replay
Value of expanded node s_i is max(-child_state_value for every child node) - every child node value is the value for the next player, so for the player at s_i its value is negative. And in order to play optimally player at s_i needs to maximize its value.