Reinforcement learning driven resolution process

Note

Check high level predictor docs for predictor basics.

As described in the introductory section, the whole resolution can be treated as a Markov Decision Process (MDP) (see MDP on Wikipedia). Treating the resolution process this way gave birth to implementations of predictors that are based on reinforcement learning (RL). The upcoming sections will discuss gradient-free methods:

MCTS based predictor as well as TD learning based predictor share core ideas and concepts. As both RL algorithms expect an opponent in their basic implementation, these algorithms had to be additionally adjusted to work well in a resolution process. The main adjustment lies in balancing exploration and exploitation as there is no “opponent” to play against (formulas like UCB1 cannot be directly applied).

Balancing exploration and exploitation in RL driven resolution process

Exploring the whole state space of all the possible software stacks can be time and computationally intensive task. Given the size of software stacks for any real world applications, it is often nearly impossible to explore the whole state space in a reasonable time. Resolving all the stacks can result in billions of combinations that are additionally scored.

In these cases, the real opponent to play against is time. The idea of temperature function from adaptive simulated annealing was used. The temperature function balances exploration and exploitation. If the temperature drops to 0, only exploitation is done. Otherwise, exploration is done with a certain probability of exploration given the current temperature (the lower temperature is, the lower probability of exploration). The temperature can take into account number of software stacks resolved so far, number of software stacks to resolve, number of resolver rounds and other internal attributes of resolver.

Note

The temperature function can be plotted when --plot option is supplied to an adviser run.

See this notebook that shows the resolution process using TD-learning predictor.