Can I create a neural network to generate Garfield comics? Let's find out!
To generate comics first I first needed to download every Garfield comic. Helpfully, some insane man created his own website for cataloging Garfield:
http://pt.jikos.cz/garfield/
Looking through his site, I realized that all the images are just hotlinks to ucomics.com
http://images.ucomics.com/comics/ga/1995/ga950116.gif
Looking at these hotlinks, there is a nice, consistent file-naming scheme ('ga'+year+month+day+'gif'). This was perfect as I could make a simple script to download and save them all.
Garfield works well for this kind of project because it's so consistent. Every daily comic is three frames of equal size. The only exception is the Sunday comics, and so I just skipped downloading those.
All comics were scraped from ucomics.com and saved into a folder. A few were taken out and put into another folder to be used for validation (More on that later).
When I first started, I wanted to use a Generative Adversarial Network (GAN). This consists of two competing networks. One network tries to learn to generate Garfield comics (the Generator) while the other tries to learn to tell the difference between real comics and fake ones (the Discriminator). Over time the Generator learns to trick the Driscriminator by creating comics that are able to pass for real ones. 12584 37752
This proved very difficult. For one thing, the Discriminator was very easy to train and learned far faster than the Generator, which then could not catch up. For another, the dataset of comics was just too small. At the time I downloaded the images, there were 12,584 Garfield comics. To increase the dataset, I decided to focus on individual frames, but that still was only 37,752 images. Typically you'd want hundreds of thousands or millions of images to train a GAN.
Also, the results from the GAN were horrifying:
So I decided to try a different model.
A Variational Auto-Encoder (VAE) is kind of like a learned compression algorithm. Like the GAN it has two parts, but they work together. The first part - the Encoder - takes an image and compresses it down to its most important $n$ features (It learns what these features are by training on the example data of comics). The resulting list of numbers is called the Latent Vector. A common example is a network trained to generate faces. Some features it might learn are gender, age, facial hair, or whether the person is wearing glasses.
The second part - the Decoder - takes the features in the latent vector and learns to reconstruct the original image. Here's some examples of reconstruction from a model trained to learn 2048 features:
On its own this doesn't seem very useful, but what's really interesting is to examine and play with the latent vector.
For example, I can take a vector filled with random numbers and send it through the decoder to generate completely new images:
Since the latent vector encodes features rather than raw pixel information of the image, you can smoothly interpolate between two different latent vectors and get interesting transitions between frames.
Let's interpolate between these two frames:
Garfield's head smoothly morphs into Odie's tongue.
In reality, the features encoded in the latent vector are not simple, independent things like "age" or "height". They are complex combinations of different features in the image that humans are not really equiped to understand. There are tools that exist to help us understand the data better though. One is Principal Component Analysis (PCA). By examining the data it identifies correlations and finds the most important features.
It's a bit hard to understand, but imagine we have two features, $x$ and $y$. We plot these features on a graph for every image in our dataset and get this:
PCA can identify that there are actually two different features in play here, represented by the black arrows. The size of the arrows is also important. The difference between points in the axis created by the longer arrow is more important than the one created by the shorter arrow.
I wanted to do PCA on all the latent vectors and identify the most important features for the Garfield comics, but my computer does not have enough memory to do this. So I was left just looking at small samples at a time.
Here are the first 25 images from my sample of 128:
I used PCA to create a much smaller latent vector of 100 features. Using these smaller vectors, my experiments went like this:
Obviously, my first experiment was to see what happens when you change the first, most important feature. I tried both extremes right away.
Here is the result when all images have their most important feature set to -15:
It seems to be slightly correlated with drawing Garfield in the bottom-right corner of the frame.
However it is more interesting when the feature is set to positive 15:
That's a strong result. It almost completely overrides whatever the original frame was to put Jon in the top-left and a standing Garfield on the right. This intuitively makes sense, becasue it seems like half of all Garfield comics involve a frame like this at some point.
Some very cool things to notice here:
The ultimate goal is to make a network that can make comics. My naive approach works by taking the latent vector of one frame and trying to predict the latent vector of the next frame. Since the latent vector is more about concepts than the actual pixels of the image, this should work at least a bit.
Here is what it learned after (very little) training:
It mostly seems to assume the next frame will be very similar to the previous one, which honestly is a pretty good assumption with Garfield.
There's a few issues with my models and a few things I would like to do in the future.
The VAE is overfitting the training data. The model has pretty high loss on the training data (about 23,800 right now), but it's much worse on the validation. I only recently even started using validation, so maybe I could have saved some time and money if I checked that sooner. The overfitting may also explain why it is not good at generating novel images.
The predictor is just basically recreating the previous frame. This is probably because there are far more examples where the next frame is similar to the previous than not. This might be fixed by punishing the network more when it is way off but I don't know a good way to do this yet.
I would like to bring outside context into the training. Some would be simple. For example, I could add the frame number to the encoding step so it can learn general differences between set up and punchline frames. Since the data is easily accessible, I could include the comic's date. This way maybe I could get it to generate comics keeping in mind the time of year.
More difficult outside context would be text in the speech bubbles. Perhaps somewhere out there someone has transcripts of every comic that I could download. Then the text could be encoded along with the frame and be used by the decoder. This would require an RNN or something though which I am not very familiar with.
The ultimate outside context would be frame descriptions. This would require getting outside help to caption every frame with things like, "Garfield yells at Jon" and "Jon is wearing a jester hat". Again, an RNN or something would be required.