You Can't Judge a Book by its Cover

(At least not with a small dataset)

2nd October 2022

Introduction

The phrase "you shouldn't judge a book by its cover" is one I have never really agreed with. When I shop for books, the cover is definitely part of my process in deciding what to pick up. There are styles to the art, designed to communicate something about the style or genre. So I decided to try to make a system for predicting how good a book would be based on its title.

Goodreads is a book social network (now owned by Amazon). They host information on millions of books alongside user submitted ratings. My strategy for this project is to scrape data from the Goodreads website, and then use it to fine tune an open source pretrained image classifier (VGG-16). The use of the pretrained network is to hopefully make up for the relatively small dataset I will be able to acquire, and quickly give me a system capably of recognising features in images.

The code used for this project can all be found in my GitHub here.

Figure 1: An example goodreads page. The URL for this particular book is: "https://www.goodreads.com/book/show/7170627-the-emperor-of-all-maladies"

Scraping Goodreads

I was initially concerned about creating the scraper. You want to get an unbiased subset (ideally of all the books on the site), but any kind of spider would probably end up exploring just one genre or author. An alternative was to use some user created lists, but these would be small and have some self correlation. However, there is a useful property of the Goodreads website which made the choosing of random books easy. Figure 1 shows an example book page. As you can see, the cover and the rating both live here, along with some other potentially useful metadata. The URL is in the format "/book/show/NUMBER-TITLE". From some experimentation, the number appears to be a unique ID which Goodreads give to every book. ANd more usefully, if you go to the webpge "/book/show/NUMBER" it takes you to "/book/show/NUMBER-TITLE".

So this would be the basis of my scraper. In conjunction with a few useful classes from Maria Antoniak I could run through as many random books as I wanted. As for the IDs chosen, from playing around, I would guess the ID numbers reflect when books were added to the site, with newer and non-English books added on average later.

I sampled from the first 6 million ID numbers to give myself a hopefully random set. Goodreads seem to have a system of blocking srapers after a while, so I ran the system on a VPN which helped. The scraper ran unfortunately slowly, (about 10s per successful scrape) which limited my dataset somewhat.

The Dataset

Figure 2: The distribution of book scores is shown. It looks reasonably normal, with a mean of 3.9 and a standard deviation of 0.29.

Figure 3: The publication dates present in my sample. Some of the pre 1800 books were really old.

I scraped 3200 books worth of covers and metadata. The first interesting observation comes from Figure 2, which shows that the spread of ratings is roughly a gaussian, with a mean of 3.9 and a standard deviation of 0.3. This is quite a tight spread, and tells you that if a book is >4.2 on goodreads it is unusual. My assumption is that this tight spread comes from the system of only allowing a whole number of stars as a rating, hence a lot clustered around 4/5.

My system of sampling from the first six million books on Goodreads means that everything represented is pre-2007, and weighted more recently. My hope is that this is still representitave across a spectrum of good and bad books. By looking at which books are rated highly and lowly you can begin to get a sense for what is successful on Goodreads. In general, a book as part of a fantasy or sci-fi series which people seek out scores highly, owing to a niche readerhood. The lowes rated book (with a reasonable number of votes) was "Brainless: The Lies and Lunacy of Ann Coulter", with a 2.69. I enjoyed this snippet of a review: "Ann Coulter is not, despite what Joe Maguire believes, a good writer. And because he spends so much time aping Coulter's style, at least in this instance, neither is Joe Maguire."

Figure 4: A subset of the covers and their associated ratings. The covers were rescaled to 224x224 px

Training the Network

As I had a small dataset of images to work from, I took a pretrained network to finetune. VGG-16 is a classifier trained on >1m images with 138m parameters. This should give my network a good level of learnt feature detection. VGG-16 outputs to 1000 classes. I removed the last layers and replaced them with linear layers mapping to a single number (the predicted rating). The other layers in the network were frozen to enable quicker training.

The 3200 scraped datapoints were randomly split into 2500 training images and 620 validation images. Figures 5 & 6 show the training outputs. It was trained for 57 epochs, with each epoch taking ~40s on a GPU. The network quickly learns something and then stagnates. There is not a clear signal from the validation set either, other than potentially a reduction in the network variability epoch to epoch.

Figure 5: The (MSE) loss during training for 57 epochs

Figure 6: The loss on the validation dataset after each epoch

Results

Figure 7 shows the results. Essentially the network is not a good predictor of the book's score. Figure 8 shows that in fact it generally just predicts the mean score with little variation around it (the prediction is a bt lower than the mean because the MSE loss is biased more towards the low scoring outliers). GIven the tight spread of the real values perhaps this is not surprising. However, if it was actually seeing some signal in the data, you would expect the distribution of estimates to better match the distributon in real values.

Figure 7: A selection of covers with the true rating "tr" and the predicted rating"pr". There is not much correlation

Figure 8: The distribution of predictions across the validation dataset. It is a tight spread around a value slightly lower than the mean (the ratings were normalised). This is probably due to the prescence of a few low scoring outliers in the dataset drawing down the squared error.

Endnotes: Possible Improvements

This project was unsuccessful. Here are a list of some of the possible reasons why;

There just isn't a strong signal in this data. Every book cover is designed to sell, and bad books will get similar looking covers to good books. The real signal may just be that some genres are rated more highly than others. On top of this possibility, user submitted ratings on Goodreads are inherently noisy. People can only use a 1-5 scale, and rate with a pretty loose correlation to actual book quality.
The 3200 book dataset is too small. This is possibly not enough covers to usefully even extract signals about the genre of the books or similar. There is a 57k database of book covers taken from Amazon here. This in conjunction with a scraping of the rating scores could be used to more quickly gather a large enough dataset for a working network.

To get a result from a similarly small dataset I could try making a dataset of some user lists like "top 1000 best books" and "top 1000 worst books". There may be enough of a signal in this to detect the differences while still using a small dataset.