Transcribing Semaphore: ML in Realtime

15th July 2023

Introduction

Semaphore is a system for medium range visual communication using hand held flags or lights. It is used in the maritime world for ship-ship or ship-shore communication when radio communication is not available. Somewhat shockingly, it is still accepted for emergency communciation by the US navy. In this blog I describe my system for real time transcription of the signals.

Figure 1 shows the basic alphabet in semaphore. There are also signals for starting and ending communications, and you can send numbers by marking certain letters with a "J". However, for the purposes of this blog, these were the only symbols used! The code used in this project can be found here.


Figure 1: From Wikipedia, the 27 basic flag semaphore signals

Datasets

There are very few semaphore datasets to be found online. The only purpose made one I could find was "Semaphore Flag Signalling Dataset" on Mendeley Data. This shows 4 naval academy cadets signalling the alphabet letters 5 times each, for a total size of 520 images. The data is clear, but not varied, with the poses, background, and interference being very consistent. However, it proved to be a good start, and was what I used for my first models.

Other than this, there were several helpful youtube videos of various enthusiasts signing out the alphabet. For these, I captured and hand labelled the key frames. My new hand labelled training data came from 1, 2, and later unseen test data was 3, 4. Further unseen testing was performed using a camera feed as shown below.


Figure 2: Two examples of the letter "I" courtesy of the naval cadets. Notice the similarity in pose to "K"

Figure 3:  One of the stills taken from this US navy flag semaphore training film. These guys are very into it 

Model structure

To make this system work in real time, it was important that the heavy lifting was taken by a heavily optimised network. I used the "MediaPipe Pose Landmarker" for this end. The structure is described in Bazarevsky et al (2020) This is a bag of methods which efficiently output a set of landmarks of a human pose in a given image. This is designed for a video setup, whereby the previous frame is used as a basis for the next. The speed is achieved by first getting a good estimate of the person's position and orientation using their face and hip alignment. The orientation and bounding box can then be sent to a further model, which can be smaller and faster given this preprocessing. The face is a good starting point as it has high contrast features, and we assume the head should be visible for a reasonable application.

Once the keypoints are extracted using MediaPipe, a subset (19 points) are passed to a simple three layer classifier. Note that this network does not require the flags to be present for classification. This will hurt its robustness as the flags are designed for visibility. It was found that augmenting the data with synthetic poses was not required for a reasonable classification accuracy. 

I included output features for 26 letters and a space, marked in the outputs as "_". I did not include a "no symbol" node, which would be a useful addition for a practical use. This would work well with labels showing people transitioning between signalling states or not signalling at all.

Figure 4: The list of landmarks produced by the MediaPipe pose detector

Figure 5: An example of the pose data output. Each datapoint contains information on x, y, z and confidence scores

Training the network

I first trained the network on the Flag Signalling dataset, with a 75% test-train split. This converged to a >99% test split accuracy in ~200 epochs with loss continuing to fall after this point.

The generalisation of this was tested on the dataset (#2) demonstrated in Figure 7. This was hand labelled, with the key frames of each signal extracted. This also included the "_" symbol. Testing the model on this dataset showed it had 28.3% accuracy, and classified the correct symbol in the top two of the softmax output 45.7% of the time. Figure 8 shows this broken down by symbol. 28% is a lot better than a 3.7% random choice error, but shows that the model is not generalising effectively.

Figure 9 shows the model now incorporating data from two datasets, applied to the third. It has generalised much more effectively, with just a few characters falling below 75% accuracy. These three datasets were then used to train one model, with 1045 images. This is a reasonably small set, but was good enough for the results in the next section.

Figure 6: Test and train loss for the simpe 520 image dataset.

Figure 7: some of the extracted frames of the second dataset of signalling poses extracted. This had 258 datapoints across two people, loosly balanced by character.

Figure 8: The model trained on dataset #1, and tested on #2. It shows that some letters can be correctly classified most of the time, with others not recognised at all. This has a 29% accuracy, and gets 46% of the letters in the top two.

Figure 9: The model trained on datasets #1 and #2, then tested on dataset #3. This model clearly generalises to the unseen data much better, although still with some letters remaining poorly classified

Testing the model on unseen data

To see if the three datasets are enough to create a generisable network, it was tested on some unseen data. Figure 10 and Figure 11 show the results of applying the model to two videos taken from YouTube. They show that the network is accurate even on these new scenes, making only slight errors. The use of the MediaPose pipeline makes this network highly resiliant to changes in people and scenery. If this network was built as a CNN from the ground up it would need to see a huge amount more training data than is available here.

This shows the power of building on top of existing generalisable networks, with the added benefit of the speed optimisations of MediaPose enabling this to be deployed in real time. One downside of this approach is that the flags themselves are not used for the purposes of classification. These are designed to be highly visible, and would presumably serve as excellent features for a fully custom network.

Figure 10: Testing the model on a data outside of the model's training set. This shows the man signing the letters A-G, which are classified well. The right hand video shows the pose landmarks outputted by the blaze model, and the box in the left hand video is the extents of the ones used for classification

Figure 11: Some of the second unseen video. The signs spell out "can I have". The classifier transcribes this accurately excepting the letter "A" 

Results

Having built a semaphore classifier which runs in real time, it was important to test it. Nicely for this purpose, flags were not necessary. Figures 12 and 13 show two people signing "Hello World" on a reasonably complicated backdrop. Black clothing and some of the background details are tricky for the blaze classifier, but the system otherwise works relatively well! Please excuse the lack of semaphore fluency on display.

Figure 12: "Hello"

Figure 13: "World"