How does learning happen in machine learning?

Machine learning is a broad set of techniques which can be used to identify patterns in data and use these patterns to help with some task like early seizure-detection/prediction, automated image captioning, automatic translation, etc. Most machine learning techniques have a set of parameters which need to be tuned. Many of these techniques rely on a very simple idea from calculus in order to tune these parameters.

Machine learning is a mapping problem

Machine learning algorithms generally have a set of parameters, which we'll call \(\theta\), that need to be tuned in order for the algorithm to perform well at a particular task. An important question in machine learning is how to choose the parameters, \(\theta\), so that my algorithm performs the task well. Let's look at a broad class of machine learning algorithms which take in some data, \(X\), and use that data to make a prediction, \(\hat y\). The algorithm can be represented by a function which makes these predictions,

\( \hat y = f(X;\theta)\).

This notation means that we have some function or mapping \(f(.; \theta)\) which has parameters \(\theta\). Given some piece of data \(X\), the algorithm will made some prediction \(\hat y\). If we change the parameters, \(\theta\), then the function will produce a different prediction.

If we choose random values for \(\theta\), there is no reason to believe that our mapping, \(f(.; \theta)\), will do anything useful. But, in machine learning, we always have some training data which we can use to tune the parameters, \(\theta\). This training data will have a bunch of input data which we can label as: \(X_i,\ i \in 1,\ 2,\ 3,\ldots\), and a bunch of paired labels: \(y_i,\ i \in 1,\ 2,\ 3,\ldots\), where \(y_i\) is the correct prediction for \(X_i\). Often, this training data has either been created by a simulation or labeled by hand (which is generally very time/money consuming).

Learning is parameter tuning

Now that we have ground-truth labels, \(y_i\), for our training data, \(X_i\), we can then evaluate how bad our mapping, \(f(.; \theta)\), is. There are many possible ways to measure how bad \(f(.; \theta)\) is, but a simple one is to compare the prediction of the mapping \(\hat y_i\) to the ground-truth label \(y_i\),

\( y_i-\hat y_i \).

Generally, mistakes which cause this error to be positive or negative are equally bad so a good measure of the error would be:

\(( y_i-\hat y_i)^2 =( y_i-f(X_i;\theta))^2\).

When this quantity is equal to zero for every piece of data we are doing a perfect mapping, and the larger this quantity is the worse our function is at prediction. Let's call this quantity summed over all of the training data the error

\(E(\theta)=\sum_i( y_i-f(X_i;\theta))^2\).

So, how do we make this quantity small? One simple idea from calculus is called gradient descent. If we can calculate the derivative or gradient of the error with respect to \(\theta\) then we know that if we go in the opposite direction (downhill), then our error should be smaller. In order to calculate this derivative, our mapping, \(f(.,\theta)\), needs to be differentiable with respect to \(\theta\).

So, if we have a differentiable \(f(.,\theta)\) we can compute the gradient of the cost function with respect to \(\theta\)

\(\frac{\partial E(\theta)}{\partial \theta}=\frac{\partial}{\partial \theta}\sum_i( y_i-f(X_i;\theta))^2=-2\sum_i( y_i-f(X_i;\theta))\frac{\partial f(X_i;\theta)}{\partial \theta}\).

If we have this derivative, we can then adjust our parameters, \(\theta^t\) such that our error is a bit smaller

\(\theta^{t+1} = \theta^t - \epsilon\frac{\partial f(X_i;\theta)}{\partial \theta}\)

where \(\epsilon\) is a small scalar. Now, if we repeat this process over and over, the value of the error, \(E(\theta)\) should get smaller and smaller as we keep updating \(\theta\). Eventually, we should get to a (local) minimum at which point our gradients will become zero and we can stop updating the parameters. This process is shown (in a somewhat cartoon way) in this figure.

If the error function is shaped somewhat like a bowl as a function of some parameter theta, we can calculate the derivative of the bowl and walk downhill to the bottom.
If the error function is shaped somewhat like a bowl as a function of some parameter theta, we can calculate the derivative of the bowl and walk downhill to the bottom.

Extensions and exceptions

This post presented a slightly simplified picture of learning in machine learning. I'll briefly mention a few of the simplifications.

The terms error function, objective function, and cost function are all used basically interchangeably in machine learning. In probabilistic models you may also see likelihoods or log-likelihoods which are similar to a cost function except they are setup to be maximized rather than minimized. Since people (physicists?) like to minimize things, negative log-likelihoods are also used.

The squared-error function was a somewhat arbitrary choice of error function. It turns out that depending on what sort of problem you are working on, e.g. classification or regression, you may want a different type of cost function. Many of the commonly used error function can be derived from the idea of maximum likelihood learning in statistics.

There are many extensions to simple gradient descent which are more commonly used such as stochastic gradient descent (sgd), sgd with momentum and other fancier things like Adam, second-order methods, and many more methods.

Not all learning techniques for all models are (or were initially) done through gradient descent. The first learning rule for Hopfield networks was not based on gradient descent although the proof of the convergence of inference was based on (not-gradient) descent. Infact, it has been replaced with a more modern version based on gradient descent of an objective function.

Vectors and Fourier Series: Part 3

In Part 2, we adapted three tools developed for vectors to functions: a Basis in which to represent our function, Projection Operators to find the components of our function, and a Function Rebuilder which allows us to recreate our vector in the new basis. This is the third (and final!) post in a series of three:

  • Part 1: Developing tools from vectors
  • Part 2: Using these tools for Fourier series
  • Part 3: A few examples using these tools

We can apply these tools to two problems that are common in Fourier Series analysis. First we'll look at the square wave and then the sawtooth wave. Since we've chosen a sine and cosine basis (a frequency basis), there are a few questions we can ask ourselves before we begin:

  1. Will these two functions contain a finite or infinite number of components?
  2. Will the amplitude of the components grow or shrink as a function of their frequency?

Let's try and get an intuitive answer to these questions first.

For 1., another way of asking this question is "could you come up with a way to combine a few sines and cosines to create the function?" The smoking guns here are the corners. Sines and cosines do not have sharp corners and so making a square wave or sawtooth wave with a finite number of them should be impossible.

For 2., one way of thinking about this is that the function we are decomposing are mostly smooth with a few corners. To get them to be smooth, we can't have more and more high frequency components, so the amplitude of the components should shrink.

Let's see if these intuitive answers are borne out.

Square Wave

We'll center the square wave vertically at zero and let it range from \([-L, L]\). In this case, the square wave function is

\(f(x)=\begin{cases}1&-L\leq x\leq 0 \\-1&0\leq x\lt L\end{cases}.\)

squarewave

If we imagine this function being repeated periodically outside the range \([-L, L]\), it would be an odd (antisymmetric) function. Since sine functions are odd and cosine functions are even, an arbitrary odd function should only be built out of sums of other odd functions. So, we get to take one shortcut and only look at the projections onto the sine function (the cosine projections will be zero). You should work this out if this explanation wasn't clear.

Since the square wave is defined piecewise, our projection integral will also be piecewise:

\(a_n = \text{Proj}_{s_n}(f(x)) \\= \int_{-L}^{0}dx\tfrac{1}{\sqrt{L}} \sin(\tfrac{n\pi x}{L}) (1)+\int_{0}^Ldx\tfrac{1}{\sqrt{L}} \sin(\tfrac{n\pi x}{L})(-1).\)

Both of these integrals can be done exactly.

\(a_n = \tfrac{1}{\sqrt{L}} \frac{-L}{n \pi}\cos(\tfrac{n\pi x}{L})|_{-L}^0 +\tfrac{1}{\sqrt{L}} \frac{L}{n \pi}\cos(\tfrac{n\pi x}{L})|_0^L \\= -\frac{\sqrt{L}}{n \pi}\cos(0)+\frac{\sqrt{L}}{n \pi}\cos(-n\pi)+\frac{\sqrt{L}}{n \pi}\cos(n\pi)-\frac{\sqrt{L}}{n \pi}\cos(0)\\= \frac{2\sqrt{L}}{n \pi}\cos(n\pi)-\frac{2\sqrt{L}}{n \pi}\cos(0).\)

\(\cos(n\pi)\) will be \((-1)^{n}\) and \(\cos(0)\) is \(1\). And so we have

\(a_n=\frac{4\sqrt{L}}{n \pi}\begin{cases}1&n~\text{odd}\\ 0 &n~\text{even}\end{cases}.\)

\(f(x)=\sum_{1,3, 5,\ldots}\frac{4}{n \pi}\sin(\tfrac{n\pi x}{L})\)

So we can see the answer to our questions. The square wave has an infinite number of components and those components shrink as \(1/n\).

Sawtooth Wave

We'll have the sawtooth wave range from -1 to 1 vertically and span \([-L, L]\). In this case, the sawtooth wave function is

\(g(x)=\frac{x}{L}.\)

sawtoothwave

This function is also odd. The sawtooth wave coefficients will only have one contributing integral:

\(a_n = \text{Proj}_{s_n}(f(x)) = \int_{-L}^Ldx\sqrt{\tfrac{1}{L}} \sin(\tfrac{n\pi x}{L})\frac{x}{L}.\)

This integral can be done exactly with integration by parts.

\(a_n = \sqrt{\tfrac{1}{L}}(-\frac{x}{n \pi}\cos(\tfrac{n\pi x}{L})+\frac{L}{n^2 \pi^2}\sin(\tfrac{n\pi x}{L}))|_{-L}^L\\=(-\frac{\sqrt{L}}{n \pi}\cos(n\pi)+\frac{\sqrt{L}}{n^2 \pi^2}\sin(n\pi))-(\frac{\sqrt{L}}{n \pi}\cos(n\pi)-\frac{\sqrt{L}}{n^2 \pi^2}\sin(n\pi))\\=\frac{2\sqrt{L}}{n^2 \pi^2}\sin(n\pi)-\frac{2\sqrt{L}}{n \pi}\cos(n\pi)\).

\(\cos(n\pi)\) will be \((-1)^n\) and \(\sin(n\pi)\) will be 0. And so we have

\(a_n=(-1)^{n+1}\frac{2\sqrt{L}}{n \pi}.\)

\(g(x)=\sum_n(-1)^{n+1}\frac{2}{n \pi}\sin(\tfrac{n\pi x}{L})\)

Again we see that an infinite number of frequencies are represented in the signal but that their amplitude falls off at higher frequency.

Vectors and Fourier Series: Part 2

In Part 1, we developed three tools: a Basis in which to represent our vectors, Projection Operators to find the components of our vector, and a Vector Rebuilder which allows us to recreate our vector in the new basis. This is the second post is a series of three:

  • Part 1: Developing tools from vectors
  • Part 2: Using these tools for Fourier series
  • Part 3: A few examples using these tools

We now want to develop these tools and apply the intuition to Fourier Series. The goal will be to represent a function (vectors) as the sum of sines and cosines (our basis). To do this we will need to define a basis, create projection operators, and create a functions rebuilder.

We will restrict ourselves to functions on the interval: \([-L,L]\). A more general technique is the Fourier Transform, which can be applied to functions on more general intervals. Many of the ideas we develop for Fourier Series can be applied to Fourier Transforms.

Note: I originally wrote this post with the interval \([0,L]\). It's more standard (and a bit easier) to use \([-L,L]\), so I've since changed things to this convention. Video has not been updated, sorry :/

Choosing Basis Functions

Our first task will be to choose a set of basis function. We have some freedom to choose a basis as long as each basis function is normalized and is orthogonal to every other basis function (an orthonormal basis!). To check this, we need to define something equivalent to the dot product for vectors. A dot product tells us how much two vectors overlap. A similar operation for functions is integration.

Let's look at the integral of two functions multiplied together over the interval: \([-L, L]\). This will be our guess for the definition of a dot product for functions, but it is just a guess.

\(\int_{-L}^Ldx~f(x)g(x)\)

If we imagine discretizing the integral, the integral becomes a sum of values from \(f(x)\) multiplied by values of \(g(x)\), which smells a lot like a dot product. In the companion video, I'll look more at this intuition.

Now, we get to make another guess. We could choose many different basis functions in principle. Our motivation will be our knowledge that we already think about many things in terms of a frequency basis, e.g. sound, light, planetary motion. Based on this motivation, we'll let our basis functions be:

\(s_n(x) = A_n \sin(\tfrac{n\pi x}{L})\)

and

\(c_n(x) = B_n \cos(\tfrac{n\pi x}{L})\).

We need to normalize these basis functions and check that they are orthogonal. Both of these can be done through some quick integrals using Wolfram Alpha. We get

\(s_n(x) = \tfrac{1}{\sqrt{L}} \sin(\tfrac{n\pi x}{L})\)

and

\(c_n(x) = \tfrac{1}{\sqrt{L}} \cos(\tfrac{n\pi x}{L})\).

This is a different convention from what is commonly used in Fourier Series (see the Wolfram MathWorld page for more details), but it will be equivalent. You might call what I'm doing the "normalized basis" convention and the typical one is more of a physics convention (put the \(\pi\)s in the Fourier space).

Projection Operators

Great! Now we need to find Projection Operators to help us write functions in terms of our basis. Taking a cue from the projection operators for normal vectors, we should take the "dot product" of our function with the basis vectors.

\(\text{Proj}_{s_n}(f(x)) = \int_{-L}^L dx~\tfrac{1}{\sqrt{L}} \sin(\tfrac{n\pi x}{L})f(x)=a_n\)

and

\(\text{Proj}_{c_n}(f(x)) = \int_{-L}^L dx~\tfrac{1}{\sqrt{L}} \cos(\tfrac{n\pi x}{L})f(x) = b_n\).

Vector Rebuilder

Now we can rebuild our function from the coefficients from the dot product multiplied by our basis functions, just like regular vectors.

\(f(x)=\sum\limits_{n=0}^\infty\text{Proj}_{s_n}(f(x))+\text{Proj}_{c_n}(f(x))\\ = \sum\limits_{n=0}^\infty \int_{-L}^L dx~\tfrac{1}{\sqrt{L}} \sin(\tfrac{n\pi x}{L})f(x)+\int_{-L}^L dx~\tfrac{1}{\sqrt{L}} \cos(\tfrac{n\pi x}{L})f(x)\\ = \sum\limits_{n=0}^\infty a_n \tfrac{1}{\sqrt{L}} \sin(\tfrac{n\pi x}{L})+b_n \tfrac{1}{\sqrt{L}} \cos(\tfrac{n\pi x}{L})\)

To recap, we've guessed a seemingly useful way of defining basis vectors, dot products, and projection operators for functions. Using these tools, we can write down a formal way of breaking down a function into a sum of sines and cosines. This is what people call writing out a Fourier Series. In the think post of the series, I'll go through a couple of problems so that you can get a flavor for how this pans out.

Youtube companion to this post:

Vectors and Fourier Series: Part 1

When I was first presented with Fourier series, I mostly viewed them as a bunch of mathematical tricks to calculate a bunch of coefficients. I didn't have a great idea about why we were calculating these coefficients, or why it was nice to always have these sine and cosine functions. It wasn't until later that I realized that I could apply much of the intuition I had for vectors to Fourier series. This post will be the first in a series of three that develop this intuition:

  • Part 1: Developing tools from vectors
  • Part 2: Using these tools for Fourier series
  • Part 3: A few examples using these tools

We can start with the abstract notion of a vector. We can think about a vectors as just an arrow that points in some direction with some length. This is a nice geometrical picture of a vectors, but it is difficult to use a picture to do a calculation. We want to turn our geometrical picture into the usual algebraic picture of vectors.

\(\vec r=a\hat x+b\hat y\)

Choosing a Basis

We will need to develop some tools to do this. One tool we will need is a basis. In our algebraic picture, choosing a basis means that we choose to describe our vector, \(\vec r\), in terms of the components in the \(\hat x\) and \(\hat y\) direction. As usual, we need our basis vectors to be linearly independent, have unit length, and we need one basis vector per dimension (they span the space).

Three steps in creating a vector.

Projection Operators

Now the question becomes: how can we calculate the components in the different directions? The way we learn to do this with vectors is by projection. So, we need projection operators. Things that eat our vector, \(\vec r\), and spit out the components in the \(\hat x\) and \(\hat y\) directions. For the example vector above, this would be:

\(\begin{aligned}\text{Proj}_x(\vec r)&=a,\\\text{Proj}_y(\vec r)&=b.\end{aligned}\)

 We want these projection operators to have a few properties, and as long as they have these properties, any operator we can cook up will work. We want the projection operator in the \(x\) direction to only pick out the \(x\) component of our vector. If there is an \(x\)  component, the projection operator should return it, and if there is no \(x\) component, it should return zero.

Great! Because we have used vectors before, we know that the projection operators are dot products with the unit vectors.

\(\begin{aligned}\text{Proj}_x(\vec r) &= \hat x\cdot\vec r=a\\\text{Proj}_y(\vec r) &= \hat y\cdot\vec r = b\end{aligned}\)

We can also check that these projection operators satisfy the properties that we wanted our operators to have.

So, now we have a way to take some arbitrary vector—maybe it was given to us in magnitude and angle form—and break it into components for our chosen basis.

Vector Rebuilder

The last tool we want is a way of taking our components and rebuilding our vector in our chosen basis. I don't know of a mathematical name for this, so I'm going to call it a vector rebuilder. We know how to do this with our basis unit vectors:

\(\vec r = a\hat x+b\hat y = \text{Proj}_x(\vec r) \hat x+\text{Proj}_y(\vec r)\hat y = \sum_{e=x,y}\text{Proj}_e(\vec r)\hat e.\)

So, to recap, we have developed three tools:

  • Basis: chosen set of unit vectors that allow us to describe any vector as a unique linear combination of them.
  • Projection Operators: set of operators that allow us to calculate the components of a vector along any basis vector.
  • Vector Rebuilder: expression that gives us our vector in terms of our basis and projection operators.

This may seem silly or overly pedantic, but during Part 2, I'll (hopefully) make it clear how we can develop these same tools for Fourier analysis and use them to gain some intuition for the goals and techniques used.

Youtube companion to this post: