Warning: Declaration of action_plugin_tablewidth::register(&$controller) should be compatible with DokuWiki_Action_Plugin::register(Doku_Event_Handler $controller) in /s/bach/b/class/cs545/public_html/fall16/lib/plugins/tablewidth/action.php on line 93

Warning: Declaration of syntax_plugin_mathjax_protecttex::render($mode, &$renderer, $data) should be compatible with DokuWiki_Syntax_Plugin::render($format, Doku_Renderer $renderer, $data) in /s/bach/b/class/cs545/public_html/fall16/lib/plugins/mathjax/syntax/protecttex.php on line 15
assignments:assignment6

Assignment 6: Linear Regression with Theano

This assignment is purely optional!

Due: November 27th at 11:59pm

In this assignment you will get your hands dirty with theano, which is a framework that has been the basis of a lot of work in deep-learning. Writing code in theano is very different than what we are accustomed to. In class you had a taste of it, where we saw how to program logistic regression. Your task for this assignment is to implement ridge regression (again!), and explore some variants of it.

Recall that ridge regression is the regularized form of linear regression, and is the linear function

$$h(\mathbf{x}) = \mathbf{w}^{T} \mathbf{x} + b$$

that minimizes the cost function

$$ \frac{1}{N}\sum_{i=1}^N (h(\mathbf{x}) - y_i)^2 + \lambda ||\mathbf{w}||_2^2.$$

The first term is the average loss incurred over the training set, and the second term is the regularization term. The regularization we considered thus far uses the so-called L2 norm, $||\cdot||_2^2$. As discussed in class (see the second slide set that discusses SVMs), there are other options, the primary one being the L1 penalty, $||\mathbf{w}||_1 = \sum_{i=1}^d |w_i|$. L1 regularization often leads to very sparse solutions, i.e. a weight vector with many coefficients that are zero (or very close to that). However, gradient descent does not work in this case, since the L1 penalty is not differentiable. A simple solution to that is to use a smooth approximation of the L1 norm defined by: $$\sum_{i=1}^d \sqrt{w_i^2 + \epsilon}.$$ This function converges to the L1 norm when $\epsilon$ goes to 0, and is a useful surrogate which can be used with gradient descent.

In this assignment we will also explore using a different loss function. As discussed in class, the squared loss $(h(\mathbf{x}) - y)^2$ has the issue of being sensitive to outliers. The Huber loss is an alternative that combines a quadratic part for its smoothness, and a linear part for resistance to outliers. We'll consider a simpler loss function: $$\log \cosh (h(\mathbf{x}) - y) ),$$ called the Log-Cosh loss. Recall that $\cosh(z) = \frac{\exp{(z)} + \exp{(-z)}}{2}$.

What you need to do for this assignment:

To help you with the implementation here's a theano symbolic expression that implements the squared loss:

squared_loss =  T.mean(T.sqr(prediction - y))

In your code, follow the standard interface we have used in coding classifiers; the code I have shown for logistic regression gives you much of what you need for the coding part of this assignment.

Submission

Submit your report via Canvas. Python code can be displayed in your report if it is short, and helps understand what you have done. The sample LaTex document provided in assignment 1 shows how to display Python code. Submit the Python code that was used to generate the results as a file called assignment6.py (you can split the code into several .py files; Canvas allows you to submit multiple files). Typing

$ python assignment6.py

should generate all the tables/plots used in your report.

Grading

A few general guidelines for this and future assignments in the course:

We will take off points if these guidelines are not followed.

Grading sheet for assignment 6

(50 points):  Correct implementation of regularized ridge regression
(20 points):  Exploration of gradient descent vs stochastic gradient descent
(30 points):  Exploration of loss and regularization term