Linear and Nonlinear Regression

We seek to build a simple linear regression model. We will have some training data, and we want to build a model for this data.

Definition

\overset{y}{^} = h_{θ} (x) = θ_{0} + θ_{1} x

Where:

$\overset{y}{^}$ is the predicted $y$ values
$h$ is the hypothesis function
$θ$ are the model parameters (weights). $θ_{j}$ is the $j^{t h}$ model Parameter (including the bias term $θ_{0}$ and the feature weights $θ_{1}$ ).
$x_{i}$ is the feature value (High temp in this case).

We also have to choose a Loss Function, in this case we usually use something called Mean Squared Error. Our objective when fitting a model is to minimize the cost function.

We then seek to make a prediction using a learned model:

\overset{y}{^} = h_{\hat{θ}} (x)

And then test error using squared loss. Why squared loss? it is often not explained well, but we want to heavily penalize large errors, and squaring the error achieves this. There is a good video by Artem Kirsanov called What Textbooks Don’t Tell You About Curve Fitting

(h_{\hat{θ}} (x) - y)^{2} = (\overset{y}{^} - y)^{2}

In terms of our linear function equation, we are seeking to find $θ_{0}$ and $θ_{1}$ by minimizing MSE.

Linear Regression with Multiple Variables

We can extend this to generalize for any number of features:

\overset{y}{^} = h_{θ} (x) = θ_{0} x_{0} always 1 + θ_{1} x_{1} + θ_{2} x_{2} + \dots + θ_{n} x_{n}

Where $x_{j}$ is the $j^{t h}$ feature. Our objective is to minimize the cost function, which is defined as:

J (θ_{0}, θ_{1}, θ_{2}, \dots, θ_{n}) = J (θ) = MSE (x, h_{θ} (x)) = \frac{1}{m} i = 1 \sum m (h_{θ} (x^{i}) - y^{i})^{2}

Simplifying our definition:

parallelized over all samples J (θ) = \frac{1}{m} (Xθ - Y)^{T} (Xθ - Y)

We usually use Gradient Descent for higher order learning, but there is a closed form math solution for two variables:

Write $θ$ s, $x$ s and $y$ s in vector form.

θ = θ_{0} θ_{1} θ_{2} ⋮ θ_{n} x = 1 x_{1} x_{2} ⋮ x_{n} y = y^{(1)} y^{(2)} y^{(3)} ⋮ y^{(n)}

Then $θ^{T} x$ becomes:

[θ_{0} θ_{1} θ_{2}] 1 x_{1} x_{2}

And we achieve $\overset{y}{^}$ :

\overset{y}{^} = θ^{T} x = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2}

And our loss can finally be defined as $J (θ)$ :

J (θ) = \frac{1}{m} i = 1 \sum m (θ^{T} x^{(i)} - y^{(i)})^{2}

And our closed form solution for $\hat{θ}$ :

\hat{θ} = (X^{T} X)^{- 1} X^{T} Y

This is great, but there are a few things that make this a costly operation at large $n$ . For one, taking the inverse of a matrix is a costly operation, and at very high points this closed form solution is somewhere between $O (2.4)$ and $O (3)$ (although, to be honest, it will probably still instantly, this is a two parameter operation). Additionally, if $X^{T} X$ is not invertible (i.e. singular or non-square), then this method fails. In these cases, we usually use Gradient Descent, which we will discuss next class.

Graph View

Linear and Nonlinear Regression

Definition

Linear Regression with Multiple Variables

Backlinks