Machine Learning – BrainData

Large language models (LLMs) are all the rage these days. They can write (occasionally buggy) code, help you write generic-sounding prose quickly…but can they write ungrammatical Christmas songs? Call me a Luddite, but I’m skeptical.

Luckily, there is a language model out there that doesn’t require 1TB of GPU RAM and a nuclear-reactor-enabled-data-center to .eval(). In fact, with a little determination, you could probably run it natively in a spreadsheet…or on pen-and-paper.

That’s right, I’m talking about Markov chains (of the first order).

But remember when you gave it must be
Let's lay here tonight
But it's a rock
Where the old and the old and that's not a new year
Sure did see mommy kissing santa comes around the mistletoe hung by the king pa rum bu bum
We all year i'll give us annoyed
And i want for christmas christmas is coming to need you better be glad you're sleeping
You my wish come
And a beard that's past
We wish you baby
I'll have a crowded room friends on his sled
Must be

Markov chains are a simple yet ubiquitous probabilistic model that assumes that the probability of what happens next only depends on what is happening now. For a language model, this means that the probability of what word (or token if you’re a nerd) comes next only depends on what the current word is. If we run this process repeatedly, word-after-word, we get things resembling sentences.

I can’t say I’m a huge fan of Christmas pop/rock songs (but I can evaluate it’s likelihood). However, I also think they are corny.

Guide us some decorations of dough
She didn't see
And practicing real famous cat all your soul of mouths to see me
I've got it over
Underneath the jingle bell time for you will you dear if you cheer
You gave you and bright
The very merry christmas girl
Jingle bell jingle bell rock the halls
Let love you dismay
I want it felt like i'm watching it over again
So long long distance
It's lovely weather for christmas all year it always this christmas to succeed

But I don’t want to deal with “gradients” or “descent” or the latest attempt to retcon RNNs (I prefer my BPTT to run sequentially and then vanish…or explode). In the words of Todd Rundgren: “I don’t want to work, I just want to” histogram some lyrics I copied off the internet.

So. I downloaded the lyrics to 30-some pop/rock Christmas songs, fired up a jupyter notebook, deleted some punctuation, and (with a mere 1 vector and 1 matrix) started generating lyrics and posted them here.

Side note: AC/DC released a Christmas song titled: “Mistress for Christmas” in 1990 about Donald Trump cheating on his wife. I had no idea.

You baby
I'm watching it from tears
'cause i believed in trouble out you guys know what time i know
And unto certain
Ding dong
Oh my bedroom fast asleep
A merry-go-round
You and cheer
A beard that's white
I thought you gave you make us some money to the halls
Don't want to set before the world will share this one
Hurry tell him hurry down

Mostly..what I’ve learned…is that Christmas songs are weird and often not really about Christmas at all. I’ll post the data and code soon so that you too can generate bad Christmas lyrics using AI.

My bedroom fast asleep
Ho ho
She got it
You've really can't stay
Oh what a happy new year i'll have termites in trouble out there all over now i've missed 'em
The window at a poor boy child what have some magic reindeer to thy perfect light shine on a button nose
So i thrill when you a cactus
This tear
I'll give all need love breathe
I'll give it to someone
I wouldn't touch you have fun if you want it as opportunistic
The mistletoe last year to town

Machine learning is a broad set of techniques which can be used to identify patterns in data and use these patterns to help with some task like early seizure-detection/prediction, automated image captioning, automatic translation, etc. Most machine learning techniques have a set of parameters which need to be tuned. Many of these techniques rely on a very simple idea from calculus in order to tune these parameters.

Machine learning is a mapping problem

Machine learning algorithms generally have a set of parameters, which we’ll call $$\theta$$, that need to be tuned in order for the algorithm to perform well at a particular task. An important question in machine learning is how to choose the parameters, $$\theta$$, so that my algorithm performs the task well. Let’s look at a broad class of machine learning algorithms which take in some data, $$X$$, and use that data to make a prediction, $$\hat y$$. The algorithm can be represented by a function which makes these predictions,

$ \hat y = f(X;\theta)$.

This notation means that we have some function or mapping $$f(.; \theta)$$ which has parameters $$\theta$$. Given some piece of data $$X$$, the algorithm will made some prediction $$\hat y$$. If we change the parameters, $$\theta$$, then the function will produce a different prediction.

If we choose random values for $$\theta$$, there is no reason to believe that our mapping, $$f(.; \theta)$$, will do anything useful. But, in machine learning, we always have some training data which we can use to tune the parameters, $$\theta$$. This training data will have a bunch of input data which we can label as: $$X_i,\ i \in 1,\ 2,\ 3,\ldots$$, and a bunch of paired labels: $$y_i,\ i \in 1,\ 2,\ 3,\ldots$$, where $$y_i$$ is the correct prediction for $$X_i$$. Often, this training data has either been created by a simulation or labeled by hand (which is generally very time/money consuming).

Learning is parameter tuning

Now that we have ground-truth labels, $$y_i$$, for our training data, $$X_i$$, we can then evaluate how bad our mapping, $$f(.; \theta)$$, is. There are many possible ways to measure how bad $$f(.; \theta)$$ is, but a simple one is to compare the prediction of the mapping $$\hat y_i$$ to the ground-truth label $$y_i$$,

$ y_i-\hat y_i $.

Generally, mistakes which cause this error to be positive or negative are equally bad so a good measure of the error would be:

$( y_i-\hat y_i)^2 =( y_i-f(X_i;\theta))^2$.

When this quantity is equal to zero for every piece of data we are doing a perfect mapping, and the larger this quantity is the worse our function is at prediction. Let’s call this quantity summed over all of the training data the error

$E(\theta)=\sum_i( y_i-f(X_i;\theta))^2$.

So, how do we make this quantity small? One simple idea from calculus is called gradient descent. If we can calculate the derivative or gradient of the error with respect to $$\theta$$ then we know that if we go in the opposite direction (downhill), then our error should be smaller. In order to calculate this derivative, our mapping, $$f(.,\theta)$$, needs to be differentiable with respect to $$\theta$$.

So, if we have a differentiable $$f(.,\theta)$$ we can compute the gradient of the cost function with respect to $$\theta$$

$\frac{\partial E(\theta)}{\partial \theta}=\frac{\partial}{\partial \theta}\sum_i( y_i-f(X_i;\theta))^2=-2\sum_i( y_i-f(X_i;\theta))\frac{\partial f(X_i;\theta)}{\partial \theta}$.

If we have this derivative, we can then adjust our parameters, $$\theta^t$$ such that our error is a bit smaller

$\theta^{t+1} = \theta^t – \epsilon\frac{\partial f(X_i;\theta)}{\partial \theta}$

where $$\epsilon$$ is a small scalar. Now, if we repeat this process over and over, the value of the error, $$E(\theta)$$ should get smaller and smaller as we keep updating $$\theta$$. Eventually, we should get to a (local) minimum at which point our gradients will become zero and we can stop updating the parameters. This process is shown (in a somewhat cartoon way) in this figure.

If the error function is shaped somewhat like a bowl as a function of some parameter theta, we can calculate the derivative of the bowl and walk downhill to the bottom.

Extensions and exceptions

This post presented a slightly simplified picture of learning in machine learning. I’ll briefly mention a few of the simplifications.

The terms error function, objective function, and cost function are all used basically interchangeably in machine learning. In probabilistic models you may also see likelihoods or log-likelihoods which are similar to a cost function except they are setup to be maximized rather than minimized. Since people (physicists?) like to minimize things, negative log-likelihoods are also used.

The squared-error function was a somewhat arbitrary choice of error function. It turns out that depending on what sort of problem you are working on, e.g. classification or regression, you may want a different type of cost function. Many of the commonly used error function can be derived from the idea of maximum likelihood learning in statistics.

There are many extensions to simple gradient descent which are more commonly used such as stochastic gradient descent (sgd), sgd with momentum and other fancier things like Adam, second-order methods, and many more methods.

Not all learning techniques for all models are (or were initially) done through gradient descent. The first learning rule for Hopfield networks was not based on gradient descent although the proof of the convergence of inference was based on (not-gradient) descent. Infact, it has been replaced with a more modern version based on gradient descent of an objective function.