What Textbooks Don't Tell You About Curve Fitting

Description

Head to https://squarespace.com/artem to save 10% off your first purchase of a website or domain using code ARTEMKIRSANOV

Socials: X/Twitter: https://x.com/ArtemKRSV Patreon: https://patreon.com/artemkirsanov

My name is Artem, I’m a graduate student at NYU Center for Neural Science and researcher at Flatiron Institute.

In this video we dive deep into a probabilistic interpretation behind the core linear regression algorithm from the ground up. We talk about how the least squares objective naturally arises when we try to maximize the probability of observed data under the model, and how the square is a result of assuming Gaussian distribution of the noise in the samples. We also explore how incorporating prior beliefs about the distribution of model parameters leads to different kinds of regularization in objective functions.

Outline: 00:00 Introduction 01:16 What is Regression 02:11 Fitting noise in a linear model 06:02 Deriving Least Squares 07:46 Sponsor: Squarespace 09:04 Incorporating Priors 12:06 L2 regularization as Gaussian Prior 14:30 L1 regularization as Laplace Prior 16:16 Putting all together

Transcript (Youtube)

Introduction

0:00

this video was brought to you by

0:01

Squarespace there is one Concept in

0:03

machine learning that nearly every

0:05

course in textbook starts with finding

0:08

the equation of a straight line that

0:10

best fits a scatter of

0:12

points the typical explanation goes

0:14

something like this we pick the line’s

0:17

parameters to minimize the average

0:19

squared vertical distance from each

0:21

point but why vertical distances why

0:24

squares and not absolute values or some

0:27

other power where does that even come

0:29

from

0:30

you’ll often hear justifications like

0:33

squares are easier to optimize or the

0:35

math just works out nicely Fair points

0:38

but to me that always felt somewhat

0:41

unsatisfying in this video I want to

0:43

share with you an alternative

0:45

perspective a probabilistic take on

0:48

linear regression that finally made

0:50

everything click for me it leads to the

0:53

same solution but offers a conceptual

0:55

shift that I found invaluable not only

0:58

does it answer those fundamental

1:00

questions it also connects the seemingly

1:03

simple problem to more complex topics

1:05

like generative modeling and parameter

1:08

regularizations in ways that might

1:10

surprise you let’s Dive Right

1:13

[Music]

1:15

In let’s first clarify what regression

What is Regression

1:18

actually means at its core it’s about

1:21

uncovering a relationship between inputs

1:23

call them X and outputs y then using

1:27

that relationship to predict y for new

1:30

values of X unlike classification where

1:33

you are choosing between discrete

1:35

categories like yes and no regression

1:38

deals with continuous values a classic

1:40

example is predicting house price from

1:43

various features perhaps X1 is the

1:46

number of bedrooms X2 distance to the

1:48

subway Etc our task is to reconstruct

1:52

the price from these features at this

1:55

point most explanations jump directly to

1:57

minimizing squar errors in the line by

2:00

reducing those vertical gaps and yes

2:03

we’ll get there but to truly understand

2:05

why this works let’s step back and

2:08

reframe it through

Fitting noise in a linear model

2:11

probability let’s treat our data as if

2:14

it emerges from a linear model plus some

2:17

noise imagine that somewhere in the

2:19

universe underlying source code the

2:22

ideal price of each house is given

2:24

precisely by a linear combination of

2:26

features weighted by some coefficients

2:30

for brevity let’s Express this as the

2:32

dot product between weights and features

2:35

stacked into

2:36

vectors however real world data is messy

2:39

and it never perfectly follows the

2:41

linear patterns actual house prices are

2:44

influenced not just by those features we

2:46

collected but also by hidden variables

2:49

that we can’t access Plus Market

2:51

fluctuations and human

2:53

behavior in other words our observed

2:56

values y are corrupted versions of those

2:59

ideal underlying values with noise term

3:02

Epsilon being added to them here Epsilon

3:05

represents everything we can’t explain

3:08

with our features all the unknown

3:10

contributors to the price beyond our

3:12

control the critical Insight is that our

3:15

resulting regression equation depends on

3:18

our assumptions about this noise namely

3:21

Epsilon is shaped by countless tiny

3:23

inferences measurement errors untrack

3:26

variables random Market fluctuations all

3:28

added together when many small

3:31

independent effects accumulate

3:33

additively something remarkable happens

3:36

according to the central limit theorem

3:38

when you sum many independent random

3:40

variables regardless of their individual

3:43

distributions their sum approaches a

3:46

normal or gorion distribution The

3:48

Familiar B curve suppose we have a

3:51

candidate set of Weights W our

3:54

hypothesis about the underlying linear

3:56

model when we examine a random house

4:00

with features X and price y we can

4:03

calculate exactly how much noise was

4:05

added to our hypothesis by taking the

4:08

difference between the observed price Y

4:11

and its ideal non-noise value with our

4:14

noise model knowing that it follows a

4:17

caution distribution we can calculate

4:20

the probability of getting precisely the

4:23

right amount of noise Epsilon that would

4:25

push the underlying price to be

4:27

registered as y this equals the

4:30

probability of observing that particular

4:32

data point I want to reiterate that

4:35

shift of perspective with a working set

4:38

of Weights w we take the features X and

4:41

compute their weighted sum given us

4:43

where the ideal noiseless version of Y

4:46

should lie around this point exists a

4:49

cloud of uncertainty a bell curve

4:52

centered at w transpose X when we

4:55

observe the actual value of y and

4:58

calculate how much much noise must have

5:00

been added we can compute the likelihood

5:03

of that happening by plugging the noise

5:06

amplitude into the gor equation this

5:09

gives us the probability of observing a

5:11

single data point as the probability of

5:14

sampling just the right amount of noise

5:16

from the caution

5:18

distribution if we assume all data

5:21

points in our scatter plot were sampled

5:24

independently then the probability of

5:26

obtaining our entire data set with a

5:28

fixed w

5:30

can be found by multiplying the

5:32

probabilities of these independent

5:34

events what we have done is expressed

5:37

the probability of observing our data

5:39

given a particular model the

5:41

coefficients W if presented with two

5:44

alternative models intuitively the

5:47

better model would be the one with a

5:49

higher probability of generating The

5:51

observed data so when solving the

5:54

regression problem and choosing the

5:56

optimal w we just need to select the

5:59

configuration that maximizes this

6:01

probability given this optimization

Deriving Least Squares

6:04

objective we can expand the formula for

6:06

the probability of data now let’s take

6:09

the logarithm of the right hand side

6:12

because logarithm is a monotonic

6:14

function whichever weight configuration

6:16

maximizes the total probability also

6:19

maximizes its logarithm the two

6:21

objectives are equivalent but the

6:23

logarithm transforms our product of

6:26

probabilities into a sum which is much

6:28

easier to work with

6:31

notice that Sigma the amplitude of the

6:33

underlying noise is a fixed value

6:36

determined by factors like Market

6:38

volatility from the perspective of the

6:40

optimization objective it is a constant

6:43

factor that doesn’t affect which set of

6:45

Weights is optimal this allows us to

6:48

simplify the formula finally the

6:51

logarithm and the exponent cancel each

6:53

other out leaving us with this this is

6:57

exactly the well-known least squares of

6:59

objective which states that the optimal

7:02

coefficients should maximize the

7:04

negative or minimize the positive of the

7:07

sum of squared errors between the linear

7:10

fit and The observed points importantly

7:14

though we arrived at this from the

7:15

perspective of finding the linear model

7:18

that maximizes the probability of

7:20

observing our data and the square in the

7:23

resulting formula is a direct

7:25

consequence of assuming the Gau

7:28

noise this problem can be then solved

7:30

either through gradient descent by

7:32

making small iterative adjustments to

7:34

the weights or by directly jumping to

7:37

this solution using a closed form

7:39

expression found in any textbook my main

7:42

goal here was to show where this formula

7:44

comes from in the first

7:45

place just like linear regression helps

Sponsor: Squarespace

7:48

us find the optimal fit for the data

7:51

finding the right platform for your

7:52

online presence is all about making

7:54

smart choices which brings me to our

7:57

today’s sponsor Squarespace Squarespace

8:00

transforms website creation and

8:02

management into a straightforward

8:04

process that anyone can Master while

8:06

starting a site from scratch might seem

8:08

challenging Squarespace design

8:10

intelligence feature makes it remarkably

8:12

simple this Innovative AI power tool kit

8:16

generates a website tailored to your

8:18

specific business in brand Vision filled

8:20

with relevant content and images from

8:23

this starting point you can enjoy total

8:25

creative freedom adjust all the visual

8:27

elements to match your style organize

8:29

content with their drag and drop tools

8:31

and even Implement ey catch in

8:33

animations to spice things up

8:35

importantly Squarespace goes beyond just

8:37

website building it’s a comprehensive

8:39

digital solution where you can develop

8:41

online courses launch effective email

8:44

campaigns or set up payment processing

8:46

through various methods all within one

8:49

cohesive ecosystem experience it

8:51

yourself with a free trial at

8:53

squarespace.com and once you’re ready to

8:55

launch visit squarespace.com AUM to save

8:59

10 % off your first purchase of a

9:01

website or

Incorporating Priors

9:04

domain previously when choosing between

9:06

different sets of Weights w we always

9:09

pick the model with a higher probability

9:11

of generating The observed data the

9:13

lower means squared error but what if

9:16

two models have identical values for

9:19

that probability but differ in their

9:21

exact values for w would we have a

9:24

reason to prefer one over the

9:26

other if we know nothing about the

9:28

nature of of our features we have no

9:31

basis of comparing two equally

9:33

performing models but often we have

9:35

prior expectations about how features

9:37

might contribute to the predictions and

9:40

thus we have reasonable boundaries for

9:42

their values let’s illustrate it with an

9:45

example suppose we have a coin with an

9:48

unknown bias where the probability of

9:51

heads is Theta between 0 and 1 we want

9:54

to estimate this value of theta by

9:56

tossing the coin and tracking the

9:58

results let’s say we observe four hats

10:01

out of five tosses if we ignore any

10:04

assumptions about Theta and find the

10:06

value that maximizes the probability of

10:09

data we will conclude that Theta must be

10:13

0.8 indeed this type of biased coin

10:16

maximizes the probability of observing

10:18

four heads out of five tosses but

10:21

something doesn’t seem right we know

10:23

from experience that most coins

10:25

typically land 50/50 maybe with a slight

10:28

bias due to a symmetry but certainly not

10:31

80 to

10:32

20 the problem is that our solution only

10:35

cared about maximizing the probability

10:38

of observed beta and completely ignored

10:41

prior beliefs about Theta which likely

10:43

is centered around 0.5 and decreases

10:46

towards the edges but is there a

10:49

systematic way to incorporate these

10:51

prior assumptions into our regression

10:54

objective instead of maximizing the

10:57

probability of data we can search for a

10:59

set of Weights W that maximizes the

11:02

joint probability of data and the

11:04

weights in other words we look for

11:07

weights that both explain the data well

11:09

and align with our prior beliefs about

11:12

what w should look like following the

11:15

conditional probability rule we can

11:17

decompose the total probability into the

11:19

following product the first factor is

11:22

the likelihood exactly what we had

11:24

before How likely a particular W is to

11:27

have generated our observed dat AA given

11:30

by the Cent formula for the noise the

11:33

second factor is the prior where we

11:35

incorporate assumptions about How likely

11:37

different values of w are the key idea

11:41

is that different assumptions on prior

11:44

distributions of Weights will lead to

11:46

different criteria how to choose between

11:48

alternative Solutions this shows up in

11:51

the overall objective as so called

11:53

regularization terms for the remainder

11:56

of the video I’d like to focus on two

11:58

most common types of regularization and

12:01

show how they are born from two common

12:05

priors one of the most popular choices

L2 regularization as Gaussian Prior

12:08

is to assume that weights W themselves

12:11

follow a zeroc centered caution

12:14

distribution why is that reasonable well

12:17

in regression each component of w is a

12:20

coefficient describing how a particular

12:22

feature in the X Vector like the size of

12:25

the house contributes to the prediction

12:27

y if we randomly select features

12:31

intuitively most will be probably

12:33

irrelevant with values near zero while

12:36

only a small subset will have

12:38

significant weights additionally since

12:41

each feature’s coefficient in real data

12:43

is shaped by many underly unobserved

12:46

causes the central limit theorem applies

12:48

to the coefficients as well formally we

12:52

can write that the prior probability of

12:55

observing a particular set of Weights is

12:57

given by the product of probab ities of

12:59

individual components Each of which is

13:02

given by the Gan formula with some

13:04

variance

13:06

tow now going back to our optimization

13:09

objective we want to maximize The Joint

13:12

probability let’s take the logarithm as

13:15

before and substitute our formulas for

13:18

the likelihood and the

13:19

prior flipping the signs and grouping

13:22

the two constants together we get the

13:25

following this is what’s known as rid

13:28

regression or for L2 regularized linear

13:31

regression due to the square of the

13:33

weight

13:34

amplitudes the idea is that we are

13:36

searching for the model that would both

13:38

explain the data well but at the same

13:41

time be not overly complex where

13:43

complexity is measured as the sum of

13:46

squares of the weights this

13:48

regularization term penalizes large

13:50

weight values pushing them toward zero

13:54

exactly what we would expect from our

13:56

gion prior assumption notice how beauti

13:59

this emerges the parameter Lambda which

14:02

controls how strong the regularization

14:04

is is the ratio between the data noise

14:07

or the variance of the samples and the

14:09

variance of the prior distribution of

14:12

coefficients when we are very certain

14:14

about our prior Lambda becomes larger

14:17

given more weight to the regularization

14:19

term conversely when the data is very

14:22

reliable with small variance Lambda

14:25

decreases placing more emphasis on

14:28

fitting the data

L1 regularization as Laplace Prior

14:31

but what if our intuition about the

14:33

weights is different instead of just

14:36

assuming they are generally small what

14:38

if we believe most should be exactly

14:41

zero with only a few significant ones

14:44

this would correspond to a model where

14:46

only a handful of features truly matter

14:49

favoring sparse

14:51

Solutions this assumption is

14:53

particularly relevant for biological

14:55

systems in genomics for example out of

14:58

thousand of genes only a small subset

15:01

typically influences a particular trait

15:04

similarly for Neuroscience only a small

15:07

sparse subset of all neurons are

15:10

responsible for encoding a particular

15:12

feature in this case a gan prior is not

15:15

ideal because it pushes weights towards

15:17

zero too gently instead we might prefer

15:20

a distribution with a sharp Peak at zero

15:23

which looks something like this this is

15:26

known as the LL distribution and it is

15:29

parametrized by symmetric exponentially

15:31

Fallen

15:32

tails the probability of the

15:34

configuration of W’s as a whole can be

15:37

found by multiplying the probabilities

15:39

of each component and following the same

15:42

derivation as before taking the

15:44

logarithm to counteract the exponent our

15:47

optimization objective becomes the

15:49

following here Lambda is again the

15:52

combination of constance of noise

15:53

variance and the falloff of the weights

15:56

prior this is known as L1 regular

15:59

ization because the complexity penalizes

16:01

the absolute values of w rather than

16:04

their squares L1 regularization

16:07

typically leads to sparse Solutions

16:10

where many weights are exactly zero

16:13

which is preferable in many domains of

Putting all together

16:16

science all right let’s tie everything

16:19

together today we explored the

16:21

probabilistic view of linear regression

16:23

and saw how the familiar least squares

16:25

equation naturally emerges when we find

16:28

the linear fit that maximizes the

16:30

probability of observing our data

16:33

importantly the squared error term

16:36

wasn’t just an arbitrary Choice it is a

16:39

direct consequence of assuming caution

16:41

noise in our model while this assumption

16:44

is reasonable in most cases it may not

16:47

be appropriate in specific settings

16:49

where noise is correlated or

16:51

multiplicative in nature we’ve also seen

16:54

how incorporating prior beliefs about

16:57

the coefficients lead to different

16:59

regularization schemes providing a

17:02

principled approach to balancing models

17:05

accuracy with complexity the cion prior

17:08

gave us L2 regularization gently pushing

17:11

all weights toward zero while the llos

17:14

prior yielded L1 regularization favoring

17:18

sparse Solutions where most weights

17:20

become exactly zero this probabilistic

17:23

perspective extends far beyond linear

17:26

regression to nearly all machine

17:27

learning models

17:29

whether examining deep neural networks

17:31

as we saw in the previous video decision

17:34

trees or clustering algorithms viewing

17:36

them through the lens of probability

17:38

provides a much deeper understanding of

17:41

their underlying assumptions and design

17:43

choices if you like the video share it

17:45

with your friends subscribe to the

17:47

channel if you haven’t already and press

17:49

the like button stay tuned for more

17:51

neuroscience and machine learning topics

17:53

coming up

17:59

[Music]

18:07

[Music]

Graph View

What Textbooks Don't Tell You About Curve Fitting

Introduction

What is Regression

Fitting noise in a linear model

Deriving Least Squares

Incorporating Priors

L2 regularization as Gaussian Prior

L1 regularization as Laplace Prior

Putting all together

Backlinks