ML From First Principles: Identifying Iris Types

2nd October 2022

Iris Setosa

Iris Virginica (photo by Eric Hunt)

Iris Versicolor (photo by D. Gordon E. Robertson)

Background and model

The "Iris Data Set" is a well known dataset which includes real measurements taken from three species of iris (Setosa, virginica, and versicolour). The measurements are of the length and width of the petals and sepals of 50 flowers of each species. The data comes from a 1936 paper "The Use of Multiple Measurements in Taxonomic Problems". In this paper, R. A. Fisher shows that the species can be discriminated using linear functions of the petal and sepal geometries, even though they individually do not describe the species. Nowadays, we have mush quicker ways to find and perform the operations described in the paper. In this blog I describe the use of a simple neural network to do so. The code can be found here.

Figure 1: The flower species by each pairwise combination of input parameters. Setosa, Virginica, Versicolor. Qualatatively Setosa is the easiest to distinguish, and no single pair can separate the other two flowers.

Figure 2: The network structure used consisted of 1 input, 1 hidden, and 1 output layer. Bias nodes were added to allow the network to use offsets. Note that these were treated as unconnected nodes so that the optimiser could treat them like any other node. This has a slight performance penalty, with the advantage of simpler code. Even for such a simple network I ended up with 35 individual weights to optimise.

A (simple) neural network uses layers of nodes. Each node is a linear combination of the nodes in the previous layer, with the linear combination being described by weights θ. I used a three layer structure in this network as described in Figure 2. Our goal is to find values of θ such that the inputs are as close as possible to the ground truth. Here, the optimal truth is 1 at the correct flower node, and a 0 at the others. Using this as the goal, we can define the cost as how far we are from this goal. In this network I used "Binary Cross Entropy" (BCE) as a measure of my distance from the goal. This is shown in Figure 3.

Optimisation

Figure 3: The cost function as defined and as implemented in code. Here C are the three classes, i are the training data, y is the ground truth (0 or 1) and h(x) is the output of the final layer nodes (as passed through a sigmoid). I did not use softmax as the sigmoid was sufficient for this simple network.

We want to optimise our weights with respect to minimising this cost function, i.e. with making the network outputs as close as possible to the ground truth of the training data. To do this we evaluate two things; the cost as defined above, and the partial derivitive of the cost with respect to each of the weight parameters. We already have the cost function from above. To determine the partial derivitives there are two approaches. The first is to approximate the function as linear in a small range (x-𝛿 ,x+𝛿). We sample the function above and below the point of interest and calculate the slope between these points. This works in general for any fuction. However, we can do better than this because the structure of the function is known. Each weight has several operations performed on it before it reaches the output node. We can use the chain rule to calculate the partial derivitive as the product of the derivative of each of the operations. The nice thing about the choice of the sigmoid and the BCE cost function is that some of the terms cancel, leaving us with a simple partial derivitive equation which is also nicely proportional to the error on the output. This is shown in equation and code form in Figure 4.

Now that we have the function and its derivitives we can use gradiant descent to optimise them. This simply adjusts each parameter in steps by a small amount (alpha) in the direction which minimises the cost function. For a reasonably smooth function this should eventually find a local (or hopefully global) minima of the function across all the input variables. The parameter alpha can be optimised experimentally. Figure 5 shows this. If alpha is too small, the model converges slowly, and additionally may get caught in small local optima. If alpha is too large, the model becomes unstable and oscillates. The correct value should be somewhere in the middle.

Figure 5: Optimising the learning rate experimentally

Results: Runtime

Three different approaces to optimising the weights of this network were used. The numerical and analytical ones described above, as well as a PyTorch implementation. The training using the numerical and analytical approaches is shown to the right. They both converge in about the same number of iterations, although there is an interesting difference between the two in the middle of the training. However, this shows iteration number and not the number of computational cycles used. If we look at the runtime figures, the backpropogation algorithm took 0.025s per iteration, to the numerical figure of 1.07s. This means backpropogation is 40 times faster!

The PyTorch implementation used approximately the same model, but with a built in optimiser (and many fewer lines of code). This model took 0.00015s per iteration, making it 18x faster than my own code implementation, and 740x faster than the numerical optimisation method.

Figure 6: Training the network with different approaches to finding the partial derivitives.

Results: Learning process

The dataset used had a total of 150 datapoints. For the purposes of training, I used 113 of these which were then tested on the remaining 37. In general all the models would correctly identify 36-37 of these depending on the randomly chosen training set. The interesting part of the model therefore is to investigate how, and how quickly, it learns. Figure 6 shows the loss with training epoch. The model quickly separates Setosa from the other two iris types. It then shows a slowing in learning rate as it learns to differenciate the other two, before it jumps quickly to a complete classification. Some training attempts also showed an early spike in correct classification as is seen here. This would identify over half of the Virginica flowers. However, it does not seem to generalise to the others as this method is later trained away.

Figure 7: a) The loss function decreasing and the proportion of flowers correctly classified increasing with time. b) The proportion of flowers classified also split by flower type. Note that these data correspond to different training sets.

As this network is so small we can investigate how it works by examining the weights and the nodes. I define node "importance" as the average of the node value times the weights coming from it. We can look across the different flower types to see how the model is accounting separating them.

Looking at the inputs, it seems that the model leans on the petal dimensions to separate Setosa and Virginica. Then, in the second layer, nodes 2-4 differenciate Setosa and Versicolor. Interestingy, Virginical is the "default" flower chosen, with all the nodes either cancelling out or being 0. Also interesting to observe is the fact that although during training it seems that Versicolor and Virginica are learnt together by comparing them, the weights suggest that both flower types are in fact compared independantly to Setosa.

All the code used for this post can be found on GitHub: https://github.com/seoirsem/Iris-Classifier