Reinforcement Learning (RL) is a Machine Learning approach that is inspired by how animals or young children learn from their mistakes (or successes). This approach combines trial and error actions with a reward. A typical natural example of this type of learning would be a young child who touches the door of a hot oven and burns his or her fingers. The pain sensation from getting burnt represents a strong negative reward that will likely cause the child to avoid touching hot oven doors from now on.

This type of learning is often applied for agent-based simulations in which an agent needs to develop a strategy for successfully interacting with an environment. The level of success is measured as a reward that an agent receives from the environment after performing an action. This reward can be positive or negative (a punishment) and it can be received instantaneously or in the future. One of the main challenges of RL concerns the correct attribution of rewards to actions when the actions lie far in the past.

Agent, Environment, Action, State, Reward

A RL system includes the following elements: Environment, Agent, State, Observation, Action, Reward. The environment represents the world within which the agent acts. The world and the agent have at each point in time a particular state. An agent represents the entity that interacts with the environment and tries to maximise its reward. An agent can observe the environment. Depending on the implementation, such an observation is identical with the state of the environment or it is different. When it is different, then an observation might reveal to the agent only partial information about the environment’s state. Based on this observation, the agent choses an action. An action is an activity conducted by the agent that might cause a change in the state of the environment. Once an action has been executed, the environment responds to it by returning to the agent a reward value and by assuming a new state.

Information passed between an Agent and Its Environment. ©Shweta Bhatt

In the case of the minimalistic and discrete agent and environment depicted below, the state is the position of the agent in a grid world. In this world, there are a total of 11 different states. The action of the agent represents the movement of the agent into a neighbouring grid cell. There are a total of four different actions at the agent’s disposal. Not in all RL-systems are the states and actions discrete. Other systems possess continuous states or actions or both.

A Minimalistic Agent and Environment. The environment is a grid world in which discrete locations cause either a positive (green cell), negative (red cell), or no reward (white cell) when the agent moves into them. ©Anish Phadnis

Basic Assumptions and Terminology

The following section describes some of the basic assumptions, terminology and equations used in reinforcement learning.

Reward Hypothesis

The reward hypothesis assumes that the goal of an intelligent agent is to maximise some sort of reward. The agent’s intelligence and all its capabilities are understood as being subservient to this goal.

Markov Property

The Markov Property assumes that an agent and its environment are memoryless. According to this property, the future state that an agent and environment will transition to depends only on the immediately preceding state and no other states further back in history.

Markov Decision Processes

The change of states due to the actions of an agent can be mathematically framed in terms of Markov Decision Processes (MDP). A MDP is a stochastic process that proceeds in discrete time steps. This process defines the probability of transitioning from one state to the next when taking an action and the associated reward.

A Markov Decision Process for a Day in the Live of a Student. Circles and rectangles represent states, arrows represent transitions, R stands for rewards, and orange words represent actions. ©David Silver

More information on Markov Decision Processes can be found for example in the article “Markov Decision Processes Simplified“.

Model-Free and Model-Based Reinforcement Learning

There exist two principle types of RL algorithms that differ from each other with respect to how they deal with state transition probabilities. A model-based RL algorithm tries to model these transition probabilities while a model-free RL algorithm does not. Therefore, a model-free RL algorithm is fully following a trial-and-error approach while a model-based RL algorithm does this to a lesser degree. All the RL algorithms explained in this article are model-free.

Policy

What an agent tries to learn in an RL-scenario is a policy that maximises reward. A policy defines the probability of an agent to take a certain action when the environment is in a certain state. The policy that maximizes the total reward is called the optimal policy. In mathematical equations, the policy is abbreviated as pi.

Policies can be implemented with any data structure and/or method that is suitable. For simple RL systems that operate on discrete states and actions, a policy can be implemented for example as a simple lookup table between states and actions. For more complicated RL systems with a large number or continuous states or actions, a policy can be implemented for example with a neural network. The same applies to state and value functions which are described further down.

On-Policy and Off-Policy Methods

In on-policy learning, the policy an agent uses to select actions is improved through learning. On-policy methods are called \epsilon-greedy policies as they select random actions with an \epsilon probability and follow the optimal action with 1-\epsilon probability.

In off-policy learning, an optimal policy is discovered independently of the policy that an agent uses to select actions. Accordingly, these methods have two policies, a behaviour policy that is used for exploring and a target policy that is used for improvement.

On- and off-policy methods have their advantages and disadvantages. On-policy methods are said to be more stable and off-policy methods are more efficient and provide a better balance between the exploration of new policies and the exploitation of the currently best policies.

Discounted Cumulative Reward

In an RL system, not only the reward that an agent receives immediately after performing an action is important. Rather, an agent tries to maximise the accumulation of all rewards it receives from a particular point in time on until the end of what is called an episode (a simulation run). Typically, when summing these rewards, the influence of rewards that lie further in the future is made less strong than more immediate rewards. This is achieved by multiplying the rewards with a discount factor that is taken to the power of time. The discounted cumulative reward is mathematically defined as follows:

Return as Discounted Cumulative Reward. Gt stands for the return at time t, Rt stands for the Reward at time t, gamma is the discount factor. These equations are for an infinitely long episode. ©Wouter van Heeswijk

Value Functions

Value functions specify the return that can be expected by an agent when following its policy. There are two types of value functions, state value function and action value function.

State Value Function

The state value function is the expected return starting from the current state and then following a policy.

State Value Function. V_pi represents the state value function for state s when following policy pi. E_pi stands for the expectation under policy pi. Gt is the discounted cumulative reward starting from time t and with state s. ©Jordi Torres

If states are discrete, state values are associated to all the states an agent and environment can be in. In case of the simple grid world example from before, the state values might assume the following values after a few training iterations:

State Values for a Minimalistic Agent and Environment. ©Michael Avendi

An optimal state value function can be expressed in terms of the optimal policy:

Optimal State Value Function. Max_pi stands for the optimal policy. ©Jordi Torres

Action Value Function

The action value function is the expected return starting from the current state, then taking an action, and then following a policy.

Action Value Function. Q_pi represents the action value function for state s and action a when following policy pi. E_pi stands for the expectation under policy pi. Gt is the discounted cumulative reward starting from time t and with state s and action a. ©Jordi Torres

If both the states and actions are discrete, action values are associated with all state action pairs. In case of the simple grid world example from before, the action values might assume the following values after a few training iterations:

Action Values for a Minimalistic Agent and Environment. ©Michael Avendi

An optimal action value function can be expressed in terms of the optimal policy:

Optimal Action Value Function. Max_pi stands for the optimal policy. ©Jordi Torres

Bellman Equation

The Bellman equation decomposes a value function into two parts, the immediate reward plus the discounted future rewards. The second part can be formulated recursively.

Bellman Equation for State Value Function. This equation multiplies the probabilities of any action an agent might take in state s with the probability of reaching any successor state s’ with the reward associated with state s and action a plus the discounted state value for the successor state. ©Jordi Torres
Bellman Equation for Action Value Function. This equation multiplies the probability of reaching any successor state s’ when taking action a with the reward associated with state s and action a plus the discounted action value for any possible successor state and action multiplied by their probabilities. ©Jordi Torres

These two equations can also can also be specified in terms of an optimal policy. In this case, the equations are named Bellman optimality equations.

Bellman Optimality Equation for the State Value Function. Max_a stands for the probability of the action that returns the maximum possible immediate reward. ©Jordi Torres
Bellman Optimality Equation for the Action Value Function. Max_a’ stands for the probability of a successor action that returns the maximum possible immediate reward. ©Jordi Torres

TD-Error

TD in TD-Error stands for time difference. The TD-Error is frequently used in RL to improve the state or action value during learning. The TD-Error measures the difference between the currently observed state or action value and the previously assumed state or action value.

For the state value, the TD-Error is mathematically formulated as follows:

TD-Error for State Value Function.

For the action value, the TD-Error is mathematically formulated as follows:

TD-Error for Action Value Function.

Using the TD-Error, the state or action values can be updated during training by adding to the current state or action value the corresponding TD-Error multiplied by the learning rate.

Update of the State Value using the TD-Error. Alpha represents the learning rate.
Update of the Action Value using the TD-Error. Alpha represents the learning rate.

Example RL-Algorithm: SARSA

SARSA is a model-free on-policy method that works best with discrete states and actions. SARSA stands for the tuple S, A, R, S1, A1. S represents the current state, A the currently taken action, R the reward received after taking action A, S1 the next state, and A1 the next action taken. SARSA operates on action values. The pseudocode for training the SARA algorithm is as follows:

Pseudocode for SARSA Algorithm. ©Adesh Gautam

Example RL-Algorithm: Q-Learning

Q-Learning is a model-free off-policy method that also works best with discrete states and actions. Q-learning seeks to find the best action to take given the current state. Accordingly, the update function for the action value is slightly different from SARSA when it comes to selecting the action for the next state and action. Instead of selecting an action that is suggested by the current policy (as is the case with SARSA), the action chosen is the one whose action value is highest. The pseudocode for training the Q-Learning algorithm is as follows:

Pseudocode for Q-Learning Algorithm. ©Zitao Shen

Policy Gradient Methods

Both SARSA and Q-Learning are value-based RL algorithm. Policy Gradient methods differ from value-based RL algorithms in that they operate directly on a policy instead of state or action value functions. Policy-based RL is effective in high dimensional and stochastic continuous action spaces and for learning stochastic policies. Value-based RL excels in sample efficiency and stability.

Many of the more sophisticated RL setups employ continuous states and actions. A classical example is the “MontainCar” simulation in which a car with a weak motor is located in a valley and needs to learn to accelerate back and forth to reach a mountain top.

Discrete State and Action Space versus Continuous State and Action Space. ©OpenAI Gym

If the policy is represented by a parameterized function such as a neural network, then training consists of optimising these parameters to maximise the expected cumulative reward. The objective of maximising the expected cumulative reward can then formulated as follows:

Objective Function J. πθ stands for the parameterised policy. Tau stands for a trajectory which is the same as an episode. ©Janis Klaise

Policy Gradient methods apply gradient ascent to maximise the objective. The gradient of the objective with respect to the parameters of the policy can be written as follows:

The gradient ∇ of the Objective Function J. ©Janis Klaise

Updating the policy parameters during training then corresponds to:

Updating Policy Parameters using Gradient Ascent. ©Janis Klaise

The gradient can be rewritten using action probabilities of the parameterised policy function and parameterised action value function:

The gradient ∇ of the Objective Function J. ©Chris Yoon

Actor-Critique Methods

Actor-Critique Methods employ both a policy function and a value function, both of them typically implemented as neural networks. The policy function is called the actor, the value function the critique. The actor decides which action to take, and the critic informs the actor how good the action taken was and how it should be adjusted. The term critique is reminiscent of the critique in Generative Adversarial Networks but its role is very different in Actor-Critique methods. In Actor-Critique methods, the actor and critique cooperate with each other and the both improve over time.

Actor-Critic Architecture. ©Richard S. Sutton and Andrew G. Barto

During training, instead of dealing with the expectation over the cumulative rewards of all possible trajectories, the cumulative rewards are only sampled from some trajectories. Furthermore, the action value function is replaced by an advantage function. An advantage function returns how much better (or worse) the reward for the currently chosen action is as compared with the current value function. For this reason, the advantage function is identical with the TD-Error.

Advantage Function of Actor-Critic Algorithm. ©Dhanoop Karunakaran
Policy-Gradient of the Actor. ©Dhanoop Karunakaran

The pseudocode for training an Actor-Critique RL system is as follows:

Pseudocode for Actor-Critique RL System. ©Chris Yoon