Ng’ Machine Learning 1

I have for the past couple of weeks been following the free, excellent course Machine Learning by prof. Ng at Coursera.

In the first week, you’ll get an introduction, learn about cost functions and gradient descent and get around the basics of linear algebra.

In short:

Cost Functions(CF): Given a model and some input parameters, a cost function will output a single value of how well, the model fits the output parameters. For instance, given the size of and the number of rooms in a house, how well does your model predict the price of the house? That is that the cost function tells you. Higher cost, badder fit. The letter J is assigned to this value.

Gradient Descent(GD): Given the cost of model, a gradient tells you in what direction you should go with your model, if you’ll get a higher cost. If you subtract the gradient multiplied by some length from your model, you’ll be heading towards a lower cost and a better fit. The letter G is assigned to this vector.

Linear Algebra: Learn it, it’s the shit.

In the second week, you’ll be thrown right at it. Linear Regression(LR) with Multiple Variables(MV). And it’s here, you realize, that during the entire course is assumed you use MATLAB/Octave. But I said no! I don’t wanna! Octave is filled with bugs, and I can’t afford the MATLAB-package. So I’ll be doing my coding in Python first, and then modify my code to fit Octave. Afterwards I’ll post the Python code here.

The model in play for LR is:

$h(x)=\theta_0+\sum_{j=1}^{n}\theta_j x_j$

where m is the number of datasets, we have, and $x_j$ is the j’th input parameter. If we set $x_0 = 1$ , then we can vectorize the model to:

$h(x)=\theta^Tx$

The cost function for LR is defined as the average of the squared errors (SE)-function.

$J(\theta)=\frac{1}{m}\sum_{i=1}^{m} (h(x_i)-y_i)^2$

where $y_i$ is the output parameter from the i’th dataset. The gradient for this cost function is given by

$G(\theta)=\frac{1}{m}\sum_{i=1}^{m} (h(x_i)-y_i)x_i$

So far so good. Now we can load our data. I’ve uploaded all the test data from the course to a dropbox, feel free to check it out. The data from this week is called “ex1data1” and ex1data2, where the latter is the one with multiple variables, housing areas and number of rooms in said house. The prediction values is the house prices. For the remainder of the course, will be using a python module called Numpy, which are cable of a lot of different maths and linear algebra, real cool thing.

import numpy as np
data = np.loadtxt('ex1data2.txt', delimiter=',')
x, y = np.insert(data[:,0:-1], 0, 1, axis=1), data[:, -1]
m, n = np.shape(x)

And show our cost function and gradient

J = (sum((x.dot(theta.T).T-y)**2))/(2.*m)
G = ((x.dot(theta.T).T-y).dot(x)/m)

Before we can do some serious coding, we need to normalize our input parameters and regularize our cost function and its gradient.

Normalization: In order for us to find the lowest cost the fastes, we need to scale our input parameters This scaling is done like this:

$x_i=\frac{x_i-mean(x_i)}{std(x_i)}$

where $mean(x_i)$ is the mean of i’th input parameter and $std(x_i)$ the standart derivation of the i’th input parameter. We don’t normalize $x_0$ .

def featureNormalize(x):
   mu = np.mean(x,axis=0)
   sigma = np.std(x,axis=0,ddof=1)
   res = (x-mu)/sigma
   return res
x[:,1:] = featureNormalize(x[:,1:])

Regularize: If you have ever tried to fit a model to your data, you know that, the more parameters your model can include, the better it will fit the data. But more often, the simplest solution is the right one. Therefore are we adding a scale ( $\frac{\lambda}{2m}$ ) times the sum of squares of $\theta_0$ excluding $\theta_0$ to the $J(\theta)$ and half the same scale times the sum alone of $\theta_0$ (still excluding $\theta_0$ ) to $G(\theta)$ . Since $G(\theta)$ is a vector, we’ll not be adding the regularizer to $G(\theta_0)$ :

$J(\theta)=J(\theta) + \frac{\lambda}{2m}\sum _{j=1}^{n}\theta_j^2$

$G(\theta)=G(\theta) + \frac{\lambda}{m}\sum _{j=1}^{n}\theta_j$

J       += lamda/(2.*m)*(theta[1:]**2).sum()
G[1:]	+= (theta.T[1:]*lamda/(m))

The total formula for the gradient can be written smarter as this:

i = (np.append([0],np.ones(n-1)))
G = ((x.dot(theta.T)).T-y).dot(x)/m + lamda/m*(i*theta)

Cool.
Now for the code for finding the best model or the best $\theta$ -values. We update $\theta$ with the direction given by the gradient times a scale length $\alpha$ a certain number of times, 1000 in this case:

$\theta=\theta - \alpha G(\theta)$

Finally we set the two scales to fixed values, I’ve found these by trial and error.

alpha, theta, lamda = 0.01, np.zeros(shape=(1, n)), 0.01
for itt in range(0,1000):
    theta = theta*(1-alpha*lamda/m*i)-alpha/m*((x.dot(theta.T)).T-y).dot(x)

Smart people out there know of the Normal Equations (NE), and we can write that as well. It finds the the best model instantly, but requires at ton of computer power, so it has its ups and downs.

iden  = np.identity(n)-np.reshape(np.append([1],np.zeros(n*n-1)), (n,n))
theta = np.linalg.inv((x.T).dot(x) + lamda*iden).dot(x.T).dot(y)

The GD method gives the following result, theta: [340397.96353532, 109813.86552488, -5846.79957164] with a J: 2043562752.7721276
The NE method gives the following result, theta: [ 340412.65957447, 110594.85004595, -6627.76250967] with a J:2044586358.6403067
So our method is not that far off (0.05%). Yay!

BONUS:
If you have tried to fit a model in Excel or some other shit, you’re given a measure of how well your fit is between minus infinity and 1, called $R^2$ , the coefficient of determination. A good fit is close to 1.
The $R^2$ value is calculated in the following way:

$R^2 = 1 -\frac{RSS}{TSS}$

RSS, residual sum of squares, is equal to the $J(\theta)$ without the regularization term. TSS, total sum of squares or the variance of the dataset, is defined in the following way,

$TSS=\frac{1}{m}\sum_{i=1}^{m} (mean(y)-y_i)^2$

Related

Published by theizo