{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A5 Pole Balancing with Reinforcement Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this assignment, you will write code for using reinforcement learning to learn to balance a pole. Follow the robot arm example in lecture notes `19 More Tic-Tac-Toe and a Simple Robot Arm`. \n",
    "\n",
    "Download this implementation, [cartpole_play.zip](https://www.cs.colostate.edu/~anderson/cs545/notebooks/cartpole_play.zip), of the pole-balancing problem.  Unzip this file to get `cartpole_play.py`.  This code requires the python packages `box2d` and `pygame`.  You may install these using\n",
    "```\n",
    "conda install conda-forge::box2d-py \n",
    "pip install pygame\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After installing these packages and unzipping `cartpole_play.zip` you should be able to run\n",
    "```\n",
    "python cartpole_play\n",
    "```\n",
    "to see the cart pole animation. Push left and right on the cart with your keyboard arrow keys to try to balance the pole."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define the class `CartPole` in a file named `cartpole.py`, following the `robot.py` example in notes 19.  Copy the `QnetAgent` class from `robot.py` into your `cartpole.py` file and modify as necessary to call the necessary functions in `cartpole_play.py`.  Define the `Experiment` class using the example in notes 19.\n",
    "\n",
    "To define your `CartPole` class, you must define the critical environment functions from `rl_framework.py`.  Try to design a reinforcement function that will lead to successful balancing.  You should only need the pole angle, which is zero when the pole is balanced.  Then your Qnet can be trained to minimize the sum of absolute values of the reinforcements.  Or you could choose to define the reinforcement as -1 if the absolute value of the angle is greater than $0.75\\pi$, 1 if less than $0.25\\pi$ and zero otherwise. \n",
    "\n",
    "To be clear and to help you get started, the structure of your `cartpole.py` file should look like\n",
    "```\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import copy\n",
    "from math import pi\n",
    "import time\n",
    "import pickle\n",
    "\n",
    "import neuralnetworksA4 as nn\n",
    "import rl_framework as rl  # for abstract classes rl.Environment and rl.Agent\n",
    "\n",
    "import cartpole_play as cp\n",
    "\n",
    "class CartPole(rl.Environment):\n",
    "\n",
    "    def __init__(self):\n",
    "        self.cartpole = cp.CartPole()\n",
    "        self.valid_action_values =  [-1, 0, 1]\n",
    "        # self.observation_size = 4  # x xdot a adot\n",
    "        self.observation_size = 5  # x xdot sin(a) cos(a) adot    NEW\n",
    "        self.action_size = 1\n",
    "        # self.observation_means = [0, 0, 0, 0]\n",
    "        self.observation_means = [0, 0, 0, 0, 0]  #               NEW\n",
    "        # self.observation_stds = [1, 1, 1, 1] \n",
    "        self.observation_stds = [1, 1, 1, 1, 1]   #               NEW\n",
    "        self.action_means = [0.0]\n",
    "        self.action_stds =  [0.1]\n",
    "        self.Q_means = [0.5 * pi]\n",
    "        self.Q_stds = [2]\n",
    "        \n",
    "    def initialize(self):\n",
    "        self.cartpole.cart.position[0] = np.random.uniform(-2., 2.)\n",
    "        self.cartpole.cart.linearVelocity[0] = 0.0\n",
    "        self.cartpole.pole.angle = 0  # hanging down \n",
    "        self.cartpole.pole.angularVelocity = 0.0\n",
    "\n",
    "    def reinforcement(self):\n",
    "\n",
    "        state = self.observe()\n",
    "        angle_magnitude = np.abs(state[2])\n",
    "\n",
    "        if angle_magnitude > pi * 0.75:\n",
    "            return -1\n",
    "        elif angle_magnitude < pi * 0.25:\n",
    "            return 1\n",
    "        else:\n",
    "            return 0\n",
    "\n",
    "        # alternative:\n",
    "        # return np.abs(angle)  # to be minimized\n",
    "\n",
    "    # add other functions to your CartPole class as needed\n",
    "    ...\n",
    "\n",
    "    def observe(self):   #                                       NEW\n",
    "        x, xdot, a, adot = self.cartpole.sense()   #             NEW\n",
    "        return x, xdot, np.sin(a), np.cos(a), adot  #            NEW\n",
    "\n",
    "     ...\n",
    "\n",
    "######################################################################\n",
    "\n",
    "class QnetAgent(rl.Agent):\n",
    "    \n",
    "    def initialize(self):\n",
    "        env = self.env\n",
    "        ni = env.observation_size + env.action_size\n",
    "        self.Qnet = nn.NeuralNetwork(ni, self.n_hiddens_each_layer, 1)\n",
    "        self.Qnet.X_means = np.array(env.observation_means + env.action_means)\n",
    "        self.Qnet.X_stds = np.array(env.observation_stds + env.action_stds)\n",
    "        self.Qnet.T_means = np.array(env.Q_means)\n",
    "        self.Qnet.T_stds = np.array(env.Q_stds)\n",
    "\n",
    "    # add other functions to your CartPole class as needed\n",
    "    ...\n",
    "\n",
    "######################################################################\n",
    "\n",
    "class Experiment:\n",
    "\n",
    "    def __init__(self, environment, agent):\n",
    "\n",
    "        self.env = environment\n",
    "        self.agent = agent\n",
    "\n",
    "        self.env.initialize()\n",
    "        self.agent.initialize()\n",
    "\n",
    "    def train(self, parms, verbose=True):\n",
    "\n",
    "        n_batches = parms['n_batches']\n",
    "        n_steps_per_batch = parms['n_steps_per_batch']\n",
    "        n_epochs = parms['n_epochs']\n",
    "        method = parms['method']\n",
    "        learning_rate = parms['learning_rate']\n",
    "        final_epsilon = parms['final_epsilon']\n",
    "        epsilon = parms['initial_epsilon']\n",
    "        gamma = parms['gamma']\n",
    "\n",
    "        ...\n",
    "\n",
    "    # add other functions to your CartPole class as needed\n",
    "    ...\n",
    "```\n",
    "\n",
    "\n",
    "Run your code as in this example:\n",
    "```\n",
    "import cartpole\n",
    "\n",
    "cartpole_env = cartpole.CartPole()\n",
    "agent = cartpole.QnetAgent(cartpole_env, [20, 20], 'max')\n",
    "\n",
    "experiment = Experiment(cartpole_env, agent)\n",
    "\n",
    "outcomes = experiment.train(parms)\n",
    "```\n",
    "with `parms` being parameters used by `Experiment`, such as\n",
    "```\n",
    "parms = {\n",
    "    'n_batches': 2000,\n",
    "    'n_steps_per_batch': 100,  \n",
    "    'n_epochs': 40,\n",
    "    'method': 'scg',\n",
    "    'learning_rate': 0.01,\n",
    "    'initial_epsilon': 0.8,\n",
    "    'final_epsilon': 0.1,\n",
    "    'gamma': 1.0\n",
    "}\n",
    "```\n",
    "The parameter values have not been chosen to best solve this problem.  For the `verbose` output while training, print the mean of all reinforcements received so far."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To test performance of a trained agent, define a test function in your `Experiment` class like the following.\n",
    "```\n",
    "    def test(self, n_steps):\n",
    "\n",
    "        states_actions = []\n",
    "        sum_r = 0.0\n",
    "\n",
    "        for initial_angle in [0, pi/2.0, -pi/2.0, pi]:\n",
    "\n",
    "            self.env.cartpole.cart.position[0] = 0\n",
    "            self.env.cartpole.cart.linearVelocity[0] = 0.0\n",
    "            self.env.cartpole.pole.angle = initial_angle\n",
    "            self.env.cartpole.pole.angularVelocity = 0.0\n",
    "\n",
    "\n",
    "            for step in range(n_steps):\n",
    "\n",
    "                obs = self.env.observe()\n",
    "                action = agent.epsilon_greedy(epsilon=0.0)\n",
    "                states_actions.append([*obs, action])\n",
    "                self.env.act(action)\n",
    "                r = self.env.reinforcement()\n",
    "                sum_r += r\n",
    "\n",
    "        return sum_r / (n_steps * 4), np.array(states_actions)\n",
    "```\n",
    "This function performs four runs, each one starting at a different `initial_angle`.  Each experiment is run for `n_steps`.  The function returns the mean of all reinforcement values over all four runs, and an array of all states and actions. You can run this function at the end of each training batch, collect the mean test reinforcements for each batch, and during `verbose` printing, include the mean of these test reinforcements so far.  You may also use the mean reinforcement value to judge how well a particular set of parameter values work, printing a table like\n",
    "```\n",
    "           nh    nb   ns  ne  init epsilon  test r sum  exec minutes\n",
    "177  [20, 20]  2000  200  40           0.8      0.1335      1.307003\n",
    "64       [20]  2000  100   5           0.5      0.0650      0.245404\n",
    "201  [40, 40]  2000  100  10           0.8      0.0295      0.448583\n",
    "34       [10]  2000  200   5           0.5      0.0050      0.439274\n",
    "76       [20]  2000  200   2           0.5     -0.0215      0.409806\n",
    "...\n",
    "```\n",
    "I included execution times for each set of parameters just to see how long each training run took.\n",
    "\n",
    "To see how well your agent is performing, plot some of the states returned by a final call to the `test` function.  For example you can plot the angles for each step by \n",
    "```\n",
    "plt.plot(states_actions[:, 2])\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Explain the design of your code, the experiments you ran, and how successful you were.  Also describe any difficulties you ran in to."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is no grading script for this assignment.  You will be graded by the effort you put into running your experiments and the amount of detail you provide in your descriptions.\n",
    "\n",
    "<font color=\"red\">Check in</font> a zip or tar file containing\n",
    "- your A5 notebook\n",
    "- `cartpole.py`\n",
    "- `neuralnetworksA4.py`\n",
    "- `optimizers.py`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extra Credit\n",
    "\n",
    "During training with various parameter values, save your best `Qnet` in a file using `pickle`.  Once you have a saved a good agent, illustrate the performance of this agent by loading it from your `pickle` file and using it to control an animation of the cart-pole using the code in `cartpole_play.py` as a guide.  \n",
    "\n",
    "When you check in your A5 solution, include the pickle file containing your best `Qnet`. Your notebook must include code at the end for loading this file, running `test` with an agent using your `Qnet`, plotting the angle during the test runs, and animating the cart-pole being controlled by your agent with your best `Qnet`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}