Figure 1: Corneal cells examined using spectral microscopy from the paper (left) and from a more modern dataset (right).
Zhang et al, 1991, Image processing of human corneal endothelium based on a learning network
Introduction
The corneal endothelium is a singe layer of cells on the inner surface of the cornea which governs fluid transport to the cornea. The cells normally are uniformly sized and roughly hexagonal. Examining the cell morphology (shape, area distribution, etc.) can be a useful way of studying disease. Using specular microscopy, it is possible to examine these cells in-vivo.
Thirty years ago, any examinations of these cells would have to be performed manually. Zhang et. al. (1991) attempted to use machine learning to automatically detect cell boundaries with minimal human input and without making statistical assumptions. Figure 1 above shows some example images. In a nominal case this looks like a simple edge detector. The paper has been poorly scanned, but even so it is possible to see the changeable lighting and focus conditions which could be tackled using a simple neural net.
My goal was to replicate this work, examine the paper in the light of modern techniques and conventions, and see if there was area for simple improvements. Figure 2 shows the structure of the paper's approach. The pre and post processing are both relatively simple image processing techniques, but the neural network is the focus of our attention.
Figure 2: The figures from the paper describing the pipeline and neural network structure (roughly a three layer cnn)
The Neural Netwok
The network architecture is a relatively simple three layer CNN with (bipolar) sigmoid activations. It is early enough that the network is described in the terms of Fukushima (1980), "Neocognitrons". These are structures inspired by the patterns of connectivity of optical nerves. These were the original inspiration for CNNs, and have "receptive fields" and "clusters". Figure 3 shows the setup used in 1991, with 3, 2, and 1 channels, and a kernel size of 11. This gives a network of 1337 parameters.
The paper uses a bipolar softmax activation. This simply allows you to work in the range [-1,1].
The Data
The data is where the network really shows its age. Figure 4 shows it. The entire training data is not the large image, but the two 97x97px regions highlighted, which have been selected to give variance in lighting conditions. The boundaries have been hand labelled. As the network does not use any padding, the labels are actually only 67x67px subsections of the larger images. At least the two 67x67 images contain ~9000px which is 6.5 times more than the number of weights. However, overfitting would appear to be a distinct possibility here.
For my training, I did not have access to the paper's images. They did not give a source, and I assume they came from the lab or research collaborators. I instead used images from here (associated paper below*). These are color images with hand labelled boundary regions, an example is shown in Figure 1. They would seem to be higher quality than what the paper was working with, but there are out of focus regions for me to test the network on. With this data, simple methods may be better than a neural network, but I wanted to explore how a NN could generalise. For my training, I extracted 2x97x97px sub images from the labelled region in one of the 30 images. These covered on the order of 10 cells each, actually slightly fewer than Zhang's ~15.
Figure 3: Descriptions of the network architecture from the paper.
Training
The paper did not mention any sort of optimiser, or their data batching strategy. I batched the two images and used Adam. In hindsight it appears that one "iteration" involved one image, so they were updating mid epoch. The loss curves look broadly similar. I used MSE loss, while they simply used squared error. The point of overfitting happens at ~100 iterations for me and ~400 for them, probably showing some of the power of Adam. Training took ~5s on a single CPU core.
One very important observation was the training instability. The training only succeeded 25% of the time, even using the paper's [-0.3,0.3] weight initialisation. The rest of the time it seemed to see vanishing gradiants, with the loss curves flattening out after the first few iterations. This is the fault of softmax, and this is shown later when the layers are examined. ReLU may have made the training simpler although a simple swap did not fix it.
Figure 5: The training loss from the paper (left), and by me (right)
Performance
Figure 6 and 7 show the trained network being shown an unseen image. The performance is surprisingly good given the limited training! The main edges are detected, and it only really stops working in the out of focus regions. An M3F Operator and ridge finding process were used in the paper to clean up the lines, which I did not explore here.
Figure 6: The trained network from the paper before and after post processing
Figure 7: The performance of the trained network on a larger image without processing
Adding More Data
The obvious first stop to imporving the network performance was to simply add more data. I took 28 of the images from the above source to use. They were only labelled in sub regions, and to simplify things I took an image subsection of the smallest label extents, 348x396px. This gives a dataset 205 times larger than was found in the original paper.
Using the original model gave a network which failed to train. Presumably it was again the vanishing gradiants issue, componded by the fact that each model update was over 7 times more data than before. This was fixed by changing the learning rate from 0.05 to 0.005, a much more standard number for modern training.
Figure 8: One example used in the increased data training.
Figure 9: Failure of the network to train over several epochs.
This was all that was needed for convergence, and with the smaller learning rate the network trained consistently. Looking at the details of the features it picks up on, it does seem to be looking for thresholds rather than learning about the deeper cell structure. The training time was significantly longer, with 30 epochs taking ~5 minutes.
Figure 12 shows a performance comparison of the larger and smaller networks on an unseen image. This shows the improvement in the generalisation capability of the larger network. It gives a cleaner and more rounded edge detection, but importantly performs much better in the out of focus corner regions. Here, it is not even just doing a relatively simple edge detection, showing hints of why you might use a CNN for this type of problem at all.
Figure 10: Learning on the larger dataset.
Figure 11: Training over 30 epochs.
Figure 12: Performance of the smaller (left) and larger (right) networks on the given input image. Notice the relative performance in the out of focus corners especially.
Structure of the Model
The paper spends some time on the structure and generalisation of the model. This includes examining the pixel intensity distribution in each layer, and the outputted image at each layer. I have replicated the diagram showing the channels of each layer. It shows similar structures at each level of the network, although the diagrams from the paper show a bit more directionality in structure. The main difference is in the model trained with more data. This shows pixels which are further from saturating in intensity. The network has avoided vanishing losses and kept more of the original structure by staying away from the extremal values.
Figure 13: The intermediary layer outputs from the original paper (left), the replicated model (center), and the larger model (right). The Structue is generally similar across all three (although of course the output is inverted in the original paper). The main point of interest is the apparent output saturation in the paper and smaller model. This is because these models are close to vanishing gradiants. The smaller learning rate in the larger model appears to mitigate this.
More Recent Research
The problem of automatically identifying endothelium cell boundaries apparently was not completely solved in the 30 years after Zhang et. al. In Kolluru et. al. (2019), image segentation algorithms were tried on this problem, namely U-Net and SegNet. They claim that, "The fully automated cell analysis available in some instrumentation software is often inaccurate". It seems that even in 2019 some level of manual labelling was still required. The main difference in approach here is the use of binary (border/non-border) classifier networks. This work uses a "large" 130 image dataset, greater than previous U-Net approaches (the 30 images I used for my training was from a 2010 paper).
The network architecture of U-Net is shown below. This is several times larger than the simple 3-layer CNNs shown here. It may have been run on a more challenging dataset than I used, but it does seem like overkill to throw such a large structure at this problem. SegNet was also tried, but Figure 14 shows that it was pretty innefective at finding the correct cell extents. The outputs would be pretty useless for finding average cell area, cell shapes, etc.
In conclusion, the old paper has some surprisingly good performance given the amount of data and compute they had to work with. The architecture seems to be about as simple as possible while still detecting the relevant structures. More modern papers have standard large model architectures to apply, but some care is still needed to get good outputs. SegNet in particular did not work in spite of the larger structure and dataset. The pre and post processing explored in the 1991 paper may take a large amount of the complexity of the non-linear portion of the problem, but the trade off between scale and tuning can be seen here.
Figure 14: U-Net architecture (left) and the output of U-Net and SegNet on this segmentation problem (right). Notice how SegNet tends to mark out much smaller cell boundaries than reality.