Link Search Menu Expand Document

Reinforcement Learning

A branch of machine learning involved with actors, or agents, taking actions in an environment to boost some incentive that they collect along the way. It is a purposefully broad term, which is why reinforcement learning methods can be applied to a wide variety of real-world problems.

Assume a user is watching someone play a video game. The agent is the player and the environment in the game. The rewards that the player receives (for example, defeating an opponent or completing a level) or does not receive (for example, stepping into a trap or losing a fight) and teaches him how to be a better player.

Reinforcement learning does not fit neatly into the Supervised, Un-Supervised, Semi-Supervised learning groups.

Each decision made by the model in supervised learning, for example, is independent and has no bearing on what we see in the future. Instead, in reinforcement learning, we are interested in our agent’s long-term approach, which might involve sub-optimal decisions at halfway steps and a trade-off between discovery (of unknown paths) and utilization of what we already know about the environment.


Let’s go over the fundamental concepts and terminology of Reinforcement Learning.


A machine that is embedded in an environment and takes measures to alter the environment’s condition. Mobile robots, software agents, and industrial controllers are some examples.


The environment is the system in which the agent perceives and acts. In RL, Markov Decision Processes (MDPs) is also known as environment. An MDP is a pair.

  • (A,S,P,R,γ)
  • S denotes a finite set of states.
  • A is a limited number of acts.
  • P is a probability matrix for state transitions.
  • R denotes a reward function.
  • y is a discount factor, [0,1].

Markov Decision Processes depicts a wide range of real-world situations, from a basic chessboard to a much more complicated video game.

The rewards are determined by whether the consumer wins or loses the game, winning actions yielding a higher return than losing actions.

Reward Function

The reward mechanism associates states with their corresponding rewards. It is the data that the agents use to learn how to handle their surroundings.

Research goes into developing a good reward function and solving sparse rewards, which occurs when the environment’s rewards are often sparse and do not encourage the agent to learn appropriately.


Policy-Based Approach

Policy-based approaches to RL aim to learn the best possible policy. Policy models would either produce the best possible transition from the current state or distribution of possible behavior.

Value-based Approach

Users want to find the optimum value function in value-based methods, which is the highest value function for all policies. Based on the model values, the user can choose which actions to take (i.e., which policy to use).


One of the most common applications in RL is the multi-armed bandit. Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot

Python Walk-through


import numpy as np

# Number of bandits
k = 3
# Our action values
Q = [0 for _ in range(k)]
# This is to keep track of the number of times we take each action
N = [0 for _ in range(k)]
# Epsilon value for exploration
eps = 0.1
# True probability of winning for each bandit
p_bandits = [0.45, 0.40, 0.80]

Reinforcement Learning is a developing area with a lot more to learn. In reality, research is yet to investigate general-purpose algorithms and models. The significant factor is to become acquainted with concepts such as value functions, policies, and MDPs.

Other useful articles:

Back to top

© , Learn Python 101 — All Rights Reserved - Terms of Use - Privacy Policy