Due date: Friday 9/17 at 11:59pm
In this assignment we will use the following datasets:
In this assignment you will work with several variants of the perceptron algorithm:
In each case make sure that your implementation of the classifier includes a bias term (in slide set 2 and page 7 in the book you will find guidance on how to add a bias term to an algorithm that is expressed without one).
Before we get to the adatron, we will derive an alternative form of the perceptron algorithm — the dual perceptron algorithm. All we need to look at is the weight update rule:
$$\mathbf{w} \rightarrow \mathbf{w} + \eta y_i \mathbf{x}_i.$$
This is performed whenever example $i$ is misclassified by the current weight vector. The thing to notice, is that the weight vector is always a weighted combination of the training examples since it is that way to begin with, and each update maintains that property. So in fact, rather than representing $\mathbf{w}$ explicitly, all we need to do is to keep track of how much each training example is contributing to the value of the weight vector, i.e. we will express it as:
$$\mathbf{w} = \sum_{i=1}^N \alpha_i y_i \mathbf{x}_i,$$
where $\alpha_i$ are positive numbers that describe the magnitude of the contribution $\mathbf{x}_i$ is making to the weight vector, and $N$ is the number of training examples.
Therefore to initialize $\mathbf{w}$ to 0, we simply initialize $\alpha_i = 0$ for $i = 1,\ldots,N$. In terms of the variables $\alpha_i$, the perceptron update rule becomes:
$$\alpha_i = \alpha_i + \eta y_i,$$
and you can always retrieve the weight vector using its expansion in terms of the $\alpha_i$.
Now we're ready for the adatron - the only difference is in the initialization and update equation.
Initialization:
$\alpha_i = 1$ for $i = 1,\ldots,N$
Like in the perceptron we run the algorithm until convergence, or until a fixed number of epochs has passed (an epoch is a loop over all the training data), and an epoch of training consists of the following procedure:
for each training example $i=1,\ldots,N$ perform the following steps:
$\gamma = y_i * \mathbf{w}^{t} \mathbf{x}_i$
$\delta\alpha = \eta * (1 - \gamma)$
if $(\alpha_i + \delta\alpha < 0)$ :
$~~~~\alpha_i = 0$
else :
$~~~~\alpha_i = \alpha_i + \delta\alpha$
The variable $\eta$ plays the role of the learning rate $\eta$ employed in the perceptron algorithm and $\delta \alpha$ is the proposed magnitude of change in $\alpha_i$. We note that the adatron tries to maintain a sparse representation in terms of the training examples by keeping many $\alpha_i$ equal to zero. The adatron converges to a special case of the SVM algorithm that we will learn later in the semester; this algorithm tries to maximize the margin with which each example is classified, which is captured by the variable $\gamma$ in the algorithm (notice that the magnitude of change proposed for each $\alpha_i$ becomes smaller as the margin increases towards 1).
Note: if you observe an overflow issues in running the adatron, add an upper bound on the value of $\alpha_i$.
Here's what you need to do:
Whenever we train a classifier it is useful to know if we have collected a sufficient amount of data for accurate classification. A good way of determining that is to construct a learning curve, which is a plot of classifier performance (i.e. its error) as a function of the number of training examples. Plot a learning curve for the perceptron algorithm (with bias) using the Gisette dataset. The x-axis for the plot (number of training examples) should be on a logarithmic scale - something like 10,20,40,80,200,400,800. Use numbers that are appropriate for the dataset at hand, choosing values that illustrate the variation that you observe. What can you conclude from the learning curve you have constructed? Make sure that you use a fixed test set to evaluate performance while varying the size of the training set.
In this section we will explore the effect of normalizing the data, focusing on normalization of features. The simplest form of normalization is to scale each feature to be in the range [-1, 1]. We'll call this scaling.
Here's what you need to do:
Submit your report via Canvas. Python code can be displayed in your report if it is short, and helps understand what you have done. The sample LaTex document provided in assignment 1 shows how to display Python code. Submit the Python code that was used to generate the results as a file called assignment2.py
(you can split the code into several .py files; Canvas allows you to submit multiple files). Typing
$ python assignment2.py
should generate all the tables/plots used in your report.
A few general guidelines for this and future assignments in the course:
We will take off points if these guidelines are not followed.
Grading sheet for assignment 1 Part 1: 60 points. (25 points): Correct implementation of the classifiers (10 points): Good protocol for evaluating classifier accuracy; results are provided in a clear and concise way (10 points): Discussion of the results Part 2: 20 points. (15 points): Learning curves are correctly generated and displayed in a clear and readable way ( 5 points): Discussion of the results Part 3: 20 points. ( 5 points): How to perform data scaling (10 points): Comparison of normalized/raw data results; discussion of results ( 5 points): Range of features after standardization