Understanding the Perceptron
08 September 2022

The Percetron is the granddaddy of modelling the brain in modern AI, well at least that's the impression I get. It has a pottered history of being well liked and then disliked, see the wiki page for it's history. I guess sort of like Elon Musk, although he may, yet, redeem himself in the eyes of his haters. I suspect the Perceptron is now only ever an education tool.

My hope is implementing it using only python is a good a first step to learning modern AI. Warning, As I am very slowly learning the field of AI anything written here is probably wrong or based on incorrect assumptions.

As a student at school or university you have have to write about a topic. Once you leave that structured learning environment you are no longer given that type of task. There is a reason schools ask you to do it. Writing about something helps you to uncover area's that you do not quite understand or struggle with. This article is me doing what I would have had to do at school

For you, the problem is I am both the student and the marker. That means I am not under an obligation to write about things that I understand, instead I am seeking out areas I do not.

With that in mind, this article is probably not a good place to learn about Perceptrons

Having read about Perceptrons in the book Python Machine learning and on the wiki page it was time to write an implementation. I was careful to avoid looking at implementations meaning I got to struggle converting my understanding of it at a math level to understanding it a code level. Maximising my learning.

A perceptron is a classifier. That means you give it some inputs and it tells you which one of two buckets you should put it in. The output is binary and only works if the data is linearly separable. That means if you plot it on a graph you can draw a straight line that separates the data. Or a plane if the data is more than two dimensions.

I could draw a diagram just like in every Perceptron article but I am lazy and don't really have much trouble with the math. See link above for nice images.

\[ prediction = \sum_{i=1..n}{ w_ix_i} + bias > 0 \]

From the equation you can see the prediction is binary, outputting either one or zero. It get better as it appears that while bias is a tun-able parameter most people just set it to zero and don't alter it. I don't really know of any real uses of the perceptron outside of it being a good learning algorithm. So lets put that to zero and forget about it, leaving us with

\[ prediction = \sum_{i=1..n}{ w_i x_i} > 0 \]

\(w_i\) is the weight component that multiplies a similarly indexed value from the sample vector. The weights vector is what we are trying to adjust to maximise the number of correct classifications. \(X\) is the sample so \(x_i\) is one component (feature).

If we say \(W\) is the vector of weights and \(X\) is the input sample (another vector of values) then doing a component by component multiplication and summing those values has a name. It's called the dot product. Lets put that into the equation.

\[ prediction = dot(X,W) > 0 \]

Well this is starting to look like something we can implement in some fairly basic python. Lets start with the dot product

def dot(X, Y):
    assert len(X) == len(Y)
    return sum([x * y for (x, y) in zip(X, Y)])

They don't call it machine learning for nothing, so we need a way for the computer to learn, adjust weights. That is, give it a bit of data, let it make a guess, compare it against a known good and use that to adjust the weights in a way that improves the guess.

The weights are adjusted each after each sample rather than after a batch(Epoch) of them has been evaluated. There is a name for that type of approach but that has not sunk into my memory just yet. Here comes another equation!

\[ W_{new} = W_{current} + \eta(expected - predicted)X\]

To adjust the weight we take the current sample (Vector of numbers) and multiply it by the difference between the \(expected\) value and the \(predicted\) value (The one we calculate) and then multiply it by \(\eta\) and add this vector the the current weight vector.

What's \(\eta\)? It's a number that we call the learning rate. I set it to 0.1 and was done. Another tun-able number, set it too big and we might not converge on a results and the weights will jump around too much. Set it too small and the it will take ages to converge on a result, like go make some coffee and watch or read the Lord of the Rings. Actually, if you know a little bit about floating point numbers then set it too small and it might never converge. Like I said set it to 0.1 and see how it goes.

Remember you are never going to use a perceptron in production, we can hack how we like :)

Did you think about the \((expected - predicted)\) part. Well if the prediction is correct it is zero so you don't adjust the weights. Get it wrong and it is either -1 or 1. Suddenly this equation is starting to look pretty simple and easy to understand.

Here's the all of the code:

class Perceptron:
    def __init__(self, learning_rate=0.1):
        self.weights = []
        self.learning_rate = learning_rate

    def fit(self, X, expected_values, num_iters = 10):
        # X is our input values
        self.weights = [0.0] * len(X[0])
        error_measure = []

        for _ in range(num_iters):
            acc_error = 0.0

            for (xi, expected) in zip(X, expected_values):
                # xi is our individual input data
                y = dot(self.weights, xi) > 0  # y, the output value
                error = expected - y
                acc_error += abs(expected - y)
                neww = []
                for i in range(len(xi)):
                    neww.append(
                        self.weights[i] + self.learning_rate * (expected - y) * xi[i]
                    )

                self.weights = neww
            error_measure.append(acc_error)

        return error_measure

    def predict(self, X):
        # Only use after training.
        Y = []
        for x in X:
            Y.append(int(dot(self.weights, x)> 0))

        return Y

I wanted it to fit in with the code here which is pretty much an updated version of the hard copy of the book.

Conclusion

It is probably odd to spend so much time on this simple, not used anymore, classification technique. I really wanted to play with it. Simplify it and just plan understand it as deeply as I can at this stage of learning the field.

A side effect of this spent time is it has taught me the value of - Pandas (for loading data) - MatPlotLib for plotting - Jupyter Notebooks - VSCode integration for Juypter Notebooks

Juypter Notebooks are something I have long known about but I was not expecting it to be so easy to install and use. In particular VSCode has great integration with them.

The actual algorithms moved from What is it? on to Oh I have coded it and then finally yep I do understand it.

Porting the code to numpy was simple, almost trivial. I should probably add the to the list of things that this small project has started to help me understand.