\newcommand{\Rv}{\mathbf{R}}
\newcommand{\rv}{\mathbf{r}}
\newcommand{\Qv}{\mathbf{Q}}
\newcommand{\Av}{\mathbf{A}}
\newcommand{\Aiv}{\mathbf{Ai}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\Yv}{\mathbf{Y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\betav}{\mathbf{\beta}}
\newcommand{\gv}{\mathbf{g}}
\newcommand{\Hv}{\mathbf{H}}
\newcommand{\dv}{\mathbf{d}}
\newcommand{\Vv}{\mathbf{V}}
\newcommand{\vv}{\mathbf{v}}
\newcommand{\Uv}{\mathbf{U}}
\newcommand{\uv}{\mathbf{u}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\TDv}{\mathbf{TD}}
\newcommand{\Tiv}{\mathbf{Ti}}
\newcommand{\Sv}{\mathbf{S}}
\newcommand{\Gv}{\mathbf{G}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\Zv}{\mathbf{Z}}
\newcommand{\Norm}{\mathcal{N}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{\dimensionbar}[1]{\underset{#1}{\operatorname{|}}}
\newcommand{\grad}{\mathbf{\nabla}}
\newcommand{\ebx}[1]{e^{\betav_{#1}^T \xv_n}}
\newcommand{\eby}[1]{e^{y_{n,#1}}}
\newcommand{\Tiv}{\mathbf{Ti}}
\newcommand{\Fv}{\mathbf{F}}
\newcommand{\ones}[1]{\mathbf{1}_{#1}}
====== Reinforcement Learning for Two-Player Games ======
How does Tic-Tac-Toe differ from the maze problem?
* Different state and action sets.
* Two players rather than one.
* Reinforcement is 0 until end of game, when it is 1 for win, 0 for draw, or -1 for loss.
* Maximizing sum of reinforcement rather than minimizing.
* Anything else?
===== Representing the Q Table =====
The state is the board configuration. There are $3^9$ of them, though
not all are reachable. Is this too big?
It is a bit less than 20,000. Not bad. Is this the full size of the Q table?
No. We must add the action dimension. There are at most 9 actions,
one for each cell on the board. So the Q table will contain about
$20,000 \cdot 9$ values or about 200,000. No worries.
Instead of thinking about the Q table as a three-dimensional array, as
we did last time, let's be more pythonic and use a dictionary. Use
the current state as the key, and the value associated with the state
is an array of Q values for each action taken in that state.
We still need a way to represent a board.
How about an array of characters? So
X | | O
---------
| X | O
---------
X | |
would be
board = np.array(['X',' ','O', ' ','X','O', 'X',' ',' '])
The initial board would be
board = np.array([' ']*9)
We can represent a move as an index, 0 to 8, into this array.
What should the reinforcement values be?
How about 0 every move except when X wins, with a reinforcement of 1,
and when O wins, with a reinforcement of -1.
For the above board, let's say we, meaning Player X, prefer move to
index 3. In fact, this always results in a win. So tthe Q value for
move to 3 should be 1. What other Q values do you know?
If we don't play a move to win, O could win in one move. So the other
moves might have Q values close to -1, depending on the skill of
Player O. In the following discussion we will be using a random
player for O, so the Q value for a move other than 8 or 3 will be
close to but not exactly -1.
How will these values be stored? We could assign them like
for i,v in [0,-0.8,0, 1,0,0, 0,-0.7,0.1]:
Q[(tuple(board),i)] = v
To update a value for move ''m'', we can just
Q[(tuple(board),m)] = some new value
If a key has not yet been assigned, we must assign it something before
accessing it.
if not Q.has_key((tuple(board),move)):
Q[(tuple(board),move)] = 0
===== Agent-World Interaction Loop =====
For our agent to interact with its world, we must implement
* Initialize Q.
* Set initial state, as empty board.
* Repeat:
* Agent chooses next X move.
* If X wins, set Q(board,move) to 1.
* Else, if board is full, set Q(board,move) to 0.
* Else, let O take move.
* If O won, update Q(board,move) by (-1 - Q(board,move))
* For all cases, update Q(oldboard,oldmove) by Q(board,move) - Q(oldboard,oldmove)
* Shift current board and move to old ones.
===== Now the Python =====
First, here is the result of running tons of games.
{{ Notes:tttResult.png?500 }}
First, let's get some function definitions out of the way.
import numpy as np
import matplotlib.pyplot as plt
import random
from copy import copy
######################################################################
def winner(board):
combos = np.array((0,1,2, 3,4,5, 6,7,8, 0,3,6, 1,4,7, 2,5,8, 0,4,8, 2,4,6))
if np.any(np.logical_or(np.all('X' == board[combos].reshape((-1,3)), axis=1),
np.all('O' == board[combos].reshape((-1,3)), axis=1))):
return True
else:
return False
def printBoard(board):
print """
%c|%c|%c
-----
%c|%c|%c
------
%c|%c|%c""" % tuple(board)
def plotOutcomes(outcomes,nGames,game):
if game==0:
return
plt.clf()
nBins = 100
nPer = nGames/nBins
outcomeRows = outcomes.reshape((-1,nPer))
outcomeRows = outcomeRows[:int(game/float(nPer))+1,:]
avgs = np.mean(outcomeRows,axis=1)
plt.subplot(2,1,1)
xs = np.linspace(nPer,game,len(avgs))
plt.plot(xs, avgs)
plt.xlabel('Games')
plt.ylabel('Result (0=draw, 1=X win, -1=O win)')
plt.subplot(2,1,2)
plt.plot(xs,np.sum(outcomeRows==-1,axis=1),'r-',label='Losses')
plt.plot(xs,np.sum(outcomeRows==0,axis=1),'b-',label='Draws')
plt.plot(xs,np.sum(outcomeRows==1,axis=1),'g-',label='Wins')
plt.legend(loc="center")
plt.draw()
Now some initializations. How do we initialize the Q table???
plt.ion()
nGames = 10000 # number of games
rho = 0.1 # learning rate
epsilonExp = 0.999 # rate of epsilon decay
outcomes = np.zeros(nGames) # 0 draw, 1 X win, -1 O win
Q = {} # initialize Q dictionary
epsilon = 1.0 # initial epsilon value
showMoves = False # flag to print each board change
We must talk a bit about ''epsilon''. This is the probability that a
random action is taken each step. If it is 1, then every action is
random. Initially we want random actions, but as experience is
gained, this should be reduced, towards zero. An easy way to do this
is to exponentially decay it towards zero, like
\begin{align*}
\epsilon_0 &= 1\\
\epsilon_{k+1} &= 0.999\,\epsilon_k
\end{align*}
or some other exponent close to 1.
Now for the main loop over multiple games. The basic structure of the
main loop is
for game in range(nGames): # iterate over multiple games
epsilon *= epsilonExp
step = 0
board = np.array([' ']*9)
done = False
while not done:
# play one game
Now again, with the guts of the loop.
for game in range(nGames): # iterate over multiple games
epsilon *= epsilonExp
step = 0
board = np.array([' ']*9)
done = False
while not done:
step += 1
# X's turn
validMoves = np.where(board==' ')[0]
if np.random.uniform() < epsilon:
# Random move
move = validMoves[random.sample(range(len(validMoves)),1)]
move = move[0] # to convert do scalar
else:
# Greedy move. Collect Q values for valid moves from current board.
# Select move with highest Q value
qs = []
for m in validMoves:
qs.append(Q.get((tuple(board),m), 0))
move = validMoves[np.argmax(np.asarray(qs))]
if not Q.has_key((tuple(board),move)):
Q[(tuple(board),move)] = 0
boardNew = copy(board)
boardNew[move] = 'X'
if showMoves:
printBoard(boardNew)
if winner(boardNew):
# X won
Q[(tuple(board),move)] = 1
done = True
outcomes[game] = 1
elif not np.any(boardNew == ' '):
# Game over. No winner.
Q[(tuple(board),move)] = 0
done = True
outcomes[game] = 0
else:
# O's turn. Random player
validMoves = np.where(boardNew==' ')[0]
moveO = validMoves[random.sample(range(len(validMoves)),1)]
boardNew[moveO] = 'O'
if showMoves:
printBoard(boardNew)
if winner(boardNew):
# O won
Q[(tuple(board),move)] += rho * (-1 - Q[(tuple(board),move)])
done = True
outcomes[game] = -1
if step > 1:
Q[(tuple(boardOld),moveOld)] += rho * (Q[(tuple(board),move)] - Q[(tuple(boardOld),moveOld)])
boardOld,moveOld = board,move
board = boardNew
if game % (nGames/10) == 0 or game == nGames-1:
plotOutcomes(outcomes,nGames,game)
print "Outcomes",np.sum(outcomes==0),"draws,",\
np.sum(outcomes == 1),"X wins,", np.sum(outcomes==-1),"O wins"
Let's discuss all of the lines where the Q
function is modified.
That's not much code. Now let's see the rest of it.
Ha ha! There's nothing else! Python conciseness!
You should be able to download and paste together these pieces into
running code. You can also adapt it to many other two-player games,
by first extracting the parts specific to Tic-Tac-Toe into functions
so the basic loop need not be modified to apply it to another game.
Now let's discuss the reinforcement learning assignment,
[[Assignments:assignment5-reinforcement-learning|Assignment 5]].