Stanford CS231n (2017 version) -- Lecture 2 Notes

Lecture 2 mainly discusses some basic knowledge of image classification (core concepts will be expressed in English).

Starting with the most classic example: given a picture of a cat, how do you correctly pick the "cat" label from a pile of labels?

In short, the problem to be solved is: Semantic gap between “cat” and pixels, which means translating the numbers in the pixels read by the computer into the concept of "cat." Moreover, cat images are diverse, and we need to consider the following challenges:

illumination:
deformation: cats are fluid...
occlusion: the cat is hiding
background clutter: blending with the background
intraclass variation: different cats look different

Using a hard-coded method like array sorting is definitely not feasible, so we consider a data-driven approach, using a labeled dataset to train a classifier: CIFAR10, with 10 labels, a training set size of 50,000, and a testing set size of 10,000.

Nearest Neighbor#

The idea: Traverse the training set to find the most "similar" image a to the current test image, using the label of this image a as the label for the test image.

So how do we mathematically express "similar"? We use L1 Distance, which calculates pixel-wise distance and sums it up, as shown in the example below.

import numpy as np

class NearestNeighbor:
  def __init__(self):
    pass

  def train(self, X, y):
    self.Xtr = X
    self.ytr = y

  def predict(self, X):
    num_test = X.shape[0]
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
  
    for i in xrange(num_test):
      distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
      min_index = np.argmin(distances)
      Ypred[i] = self.ytr[min_index]

    return Ypred

Assuming num_test = 10, the image is divided into 4*4 = 16 pixel blocks, the dimension of X is [10, 16], the dimension of distances is [10, 1], and the dimension of ytr is [50000, 1] (assuming we are using the entire CIFAR dataset of 50,000 images for training, each image corresponding to a label).

This method seems simple and good, so why not use it? Next, we need to mention "time complexity." The training time complexity is O(1), while the prediction time complexity is O(N), because predicting one image requires traversing all (for example, 50,000 in this case) images one by one to compute L1 distance. Once the dataset is expanded, prediction will be very, very slow. We can accept slower training because it's behind-the-scenes work, but slow inference is unacceptable. Therefore, this method is impractical.

K-Nearest Neighbors#

The "Nearest Neighbor" method mentioned in the previous section essentially copies the label of a certain image from the training set. Not to mention the time complexity issue, this copying strategy itself can lead to problems:

Interference from noise can cause irregular or sharp boundaries in the partitioned areas.
It cannot properly handle noise or outliers (as shown by the yellow point in the green area below).

To upgrade the Nearest Neighbor method, we made changes:

What does this mean? First, we determine the distance function we are using; both L1 and L2 can be used, and then we select a K parameter. Assuming we choose L1 and K=5, we calculate the L1 distance between the current test image and all images in the training set (performing 50,000 calculations), and then select the 5 training images with the smallest L1 distances. If among these 5 neighbors, two neighbors have the label cat, one has dog, one has car, and one has shit, then intuitively, we can consider the label of the test image to be cat. In addition to majority voting, I should also mention another situation not covered in the course: weighted voting.

majority voting (mentioned in the course): finding the label with the most occurrences among K.
weighted voting: assuming K=3, the L1 distances of the three neighbors are 1, 2, and 3, with labels A, B, and A respectively. The weight calculation would be: label A: 1/1 + 1/3 ≈ 1.333, label B: 1/2 = 0.5, thus we consider the label of the test image to be A because its weight is higher.
In short, this is a slight enhancement to majority voting, capable of handling situations where the number of labels is the same; this idea runs throughout machine learning.

The choice of K is also significant; a larger K value can reduce the model's overfitting to outliers or noise in the training data through "majority voting" or "weighted averaging," enhancing the model's generalization ability. For specifics, see the interactive visualization webpage created by Stanford, which is very helpful for understanding. Of course, the representation of scatter points and the pixel value differences in our actual images are not completely consistent; the L1 distance between scatter points is the Manhattan distance, while the L2 distance is the length of the line segment from point to point. The scatter points abstract the entire classification problem, which requires careful consideration.
http://vision.stanford.edu/teaching/cs231n-demos/knn/
All the points on the graph are from the training set, with one color representing one label. Assuming the red area represents cat, that cluster of red points corresponds to some cat images affected by illumination, deformation, occlusion, etc. They have differences but share similar "patterns," so they do not completely overlap but are not far from each other.
When playing with this interactive page, you will find significant changes as K=1 → 3 → 5, smoothing the classification boundary.

How to Choose Hyperparameters#

The term Hyperparameters should be familiar to everyone. In the above problem, how to choose the K value and how to select the distance function will directly affect the model's prediction results. These parameters that we need to set rather than let the model learn are called "hyperparameters." In testing how good our hyperparameters are, we should follow the principle:

To ensure that we do not touch the test dataset too early, a good practice is to set aside a certain proportion of the dataset as a validation set.
The teacher's suggestion is: he will not touch the test dataset until a week before the paper deadline.
Cross-validation is also a feasible method, but due to the high training cost on large datasets, it is only used on small datasets.

Summary#

In fact, the K-nearest neighbor method has never been used on images, as shown in the course materials, there are significant issues. The L2 distance between pixels does not contain enough effective information; specifically, the L2 distances of the three images in the lower right are the same as that of the original image on the left, but it is clear that these three images are quite different.

Thus, we need to introduce the method of Linear Classification. I declare: this class was in vain! 😀
At the end of Lecture 2, a little content on linear classification was mentioned, and I plan to introduce it completely in the notes for Lecture 3.