General approach to developing an image classification algorithm for Dilbert cartoons

Question:

As a self-development exercise, I want to develop a simple classification algorithm that, given a particular cell of a Dilbert cartoon, is able to identify which characters are present in the cartoon (Dilbert, PHB, Ratbert etc.).

I assume the best way to do this is to (1) apply some algorithm to the image, which converts it into a set of features, and (2) use a training set and one of many possible machine learning algorithms to correlate the presence/absence of certain features with a particular character being present in the cell.

So my questions are – (a) is this the correct approach, (b) since there’s a number of classification algorithms and ML algorithms to test, what is a good methodology for finding the right one, and (c) which algorithms would you start with, given that we’re essentially conducting a classification exercise on a cartoon.

Asked By: Andrew J

||

Answers:

I think, in general, this is the correct approach and that there are two techniques you could have a look at.

  • Eigenfaces, the name for all algorithms that use eigenvectors for facial recognition.
  • The Bag-of-Words or visual words approach.
Answered By: Maurits

So i think you are on the right track w/r/t your step 1 (apply some algorithm to the image, which converts it into a set of features).

This project is more challenging that most ML problems because here you will actually have to create your training data set from the raw data (the individual frames comprising the cartoons). For instance, grab a frame, identify two characters in that frame, Dilbert and the character with horns (Dilbert’s boss i believe, don’t know his name), extract those two characters from that frame and append to each the appropriate class label (e.g., “1” for Dlibert).

Step 1

To extract the individual characters from each of the frames comprising the a Dilbert cartoon, i would suggest a spectral decomposition of each frame. If you are not familiar with this technique, at its core, it’s just an eigenvector decomp.

If you like python (or R, given that you can use python-to-R bindings like RPy) then i would strongly encourage you to look at sklearn. In particular, this excellent library (which was originally developed under the SciPy scikits project umbrella, and therefore uses NumPy + SciPy for matrix computation) has several algorithms for image segmentation, one of which is based on spectral clustering. For this step in your Project, you would most likely want to look at these two scikits.learn modules

  • sklearn.feature_extraction (esp. the image sub-module)

  • sklearn.cluster.spectral_clustering

Included with these two modules are two good example scripts, one segmenting a digital photograph and the other segmenting an image comprised of three partially super-imposed circles with minimal contrast w/r/t each other and w/r/t the background–both, i suspect are more difficult problems that the decompositions you will need to perform. In other words, sklearn has two complete, well-documented example scripts included in the source distribution, both of which process data similar to yours. Either or both would be an excellent template for this step.

Step 2

So that’s the first step; here’s the second: sort all of the components of the decomposed images into groups, one group for each Dilbert character. Next, assign a class label to each Group, e.g., if there are four characters from your decomposition step, then a decent choice for class labels is “0”, “1”, “2”, and “3.” Append those class labels to the component matrices (the decomposition products from step 1) so each character matrix is mapped to its corresponding class (Dilbert character).

Step 3

Select a suitable ML technique. You have many choices for this step; the only criteria are that the technique is in the supervised category (because you have assigned class labels to your data) and that it function as a classifier (i.e., it returns a class label, versus a regressor which outputs a numerical value). Given this is a personal project, i would chose the one that seems most interesting to you. A few that satisfy the criteria i just mentioned are: multi-layer perceptron (neural network), support vector machine (SVM), and k-nearest neighbors (kNN).

Step 4

train, validate, and test your classifier

Alternative Technique: Template Matching

Once Step 1 is completed (each image is decomposed into a set of objects, some of which will no doubt represent the characters) you can manually sift through these decomposition products and collect exemplars for each character in the cartoon. The are the templates.

Next you compare objects segmented from an image with this set of unique templates. In scikit-image, another scipy scikit, you can use the method match_template, to which you pass in a template image and a candidate image, and this method returns a 2D array showing the pixel-by-pixel correlation (between -1 and 1).

Answered By: doug

You can try building a model by uploading your training data (images of comics) to demo.nanonets.ai (free to use)

Then query the API using the following (Python Code):

import requests
import json
import urllib
model_name = "Enter-Your-Model-Name-Here"
url = "http://static5.businessinsider.com/image/525464f969bedd0b0422cfb6/dilbert-creator-scott-adams-presents-his-10-favorite-comics-of-all-time.jpg"
files = {'uploadfile': urllib.urlopen(url).read()}
url = "http://demo.nanonets.ai/classify/?appId="+model_name
r = requests.post(url, files=files)
print json.loads(r.content)

the response looks like:

{
  "message": "Model trained",
  "result": [
    {
      "label": "Dilbert",
      "probability": 0.97
    },
    {
      "label": "PHB",
      "probability": 0.025
    },
    {
      "label": "Ratbert",
      "probability": 0.005
    }
  ]
}
Answered By: sj7

This question was asked 5 years back, therefore the answers provided above are outdated given the fact that deep learning has changed the face of computer vision over the past 3-4 years. A deep learning based solution would involve training a Convolutional Neural Network, which would learn features and perform classification in an end-to-end learning framework. However, since multiple cartoons may be present in the same image, the standard softmax cross entropy loss used in image classification may not be appropriate. Hence, independent logistic regression should be used as a loss function. Threshold for each class can be obtained based on accuracy obtained over a held-out validation set. Even for cartoons, it is better to use a pre-trained model initialized using imagenet instead of training from scratch (https://arxiv.org/pdf/1611.05118v1.pdf, although the final task in this paper is different, they still do processing on cartoons). If you have abundant data, pre-training may not be that important. This task can be performed using standard libraries like caffe/torch etc.

Answered By: Bharat