Reinforcement learning driven resolution process¶
Note
Check high level predictor docs for predictor basics.
As described in the introductory section, the whole resolution can be treated as a Markov Decision Process (MDP) (see MDP on Wikipedia). Treating the resolution process this way gave birth to implementations of predictors that are based on reinforcement learning (RL). The upcoming sections will discuss gradient-free methods:
Monte Carlo Tree Search (also known as Monte Carlo learning)
Temporal Difference learning and its n-step variation
MCTS based predictor as well as TD learning based predictor share core ideas and concepts. As both RL algorithms expect an opponent in their basic implementation, these algorithms had to be additionally adjusted to work well in a resolution process. The main adjustment lies in balancing exploration and exploitation as there is no “opponent” to play against (formulas like UCB1 cannot be directly applied).
Balancing exploration and exploitation in RL driven resolution process¶
Exploring the whole state space of all the possible software stacks can be time and computationally intensive task. Given the size of software stacks for any real world applications, it is often nearly impossible to explore the whole state space in a reasonable time. Resolving all the stacks can result in billions of combinations that are additionally scored.
In these cases, the real opponent to play against is time. The idea of temperature function from adaptive simulated annealing was used. The temperature function balances exploration and exploitation. If the temperature drops to 0, only exploitation is done. Otherwise, exploration is done with a certain probability of exploration given the current temperature (the lower temperature is, the lower probability of exploration). The temperature can take into account number of software stacks resolved so far, number of software stacks to resolve, number of resolver rounds and other internal attributes of resolver.
Note
The temperature function can be plotted when --plot
option is supplied to
an adviser run.
See this notebook that shows the resolution process using TD-learning predictor.