Stanford CS231n (2017 version) -- Lecture 3: Loss Function and Optimization

After rejecting the Nearest Neighbor in Lecture 2, the next thing to try is Linear Classification.

Introduction#

We use a parametric approach for classification.

As shown in the figure above, we multiply a 103072 parameter matrix W by x (where x is obtained by dividing an image into 3232 pixel blocks, each block having three channel values, and finally flattening these 32323 values into a one-dimensional vector) to get a 101 result, which contains the scores for 10 different labels. The label with the highest score is the predicted label by the model.
You might be curious: what would this W matrix look like if visualized? Would it be quite abstract?
We restore each row of the 103072 W matrix back to a square image, as shown below.

On a small dataset, we can slightly see some features recognizable by the human eye.

Multiclass SVM loss#

This W looks like the "brain" of the decision, but with so many numbers inside, how do we choose it?
We introduce the "Loss Function" to quantify our dissatisfaction with the prediction results.
We use a simpler example with three categories: frog, cat, car.

Imagine a scenario: the teacher asks a student to answer a multiple-choice question (the correct answer is A), and the student hesitantly says, "I probably choose A, but I might choose B." Clearly, he does not grasp the knowledge as well as those who confidently answer option A. Similarly, we want the highest score for cat to be significantly greater than the scores for car and frog, indicating that the model confidently chose the correct answer.
That's what Multiclass SVM loss is about, $L_i = \sum_{j \neq y_i} \max(0, S_j - S_{y_i} + 1)$ . When we predict the category of the first image above, we get three scores in the first column, at which point $S_{y_i}$ is 3.2 (the score corresponding to the actual label cat, even though it is smaller than 5.1), because we know the actual label of this image is cat. Next, we act as the teacher to evaluate how this classifier (the weight W) performed. Our criterion is: the score corresponding to the true label must be at least 1 point higher than the other scores to be considered that you really understand. 3.2 - 1 < 5.1, that's a big mistake, a serious reprimand. 3.2 - 1 > -1.7, good, no confusion with frog. The final $L_i = 2.9$ , note!! The larger $L_i$ indicates a poorer understanding, so a lower value indicates better model comprehension.
The prediction for the second image is good, 4.9 is at least 1 greater than the other two scores; the prediction for the third image is the worst, with losses reaching 12.9.

Regularization#

The above Loss Function seems to measure the model's capability, but is it enough to have a W trained on the training set that results in lower Loss values during training?
Not enough,

As shown by the blue line in the figure, if the Loss Function only has this term, blindly learning on the training set may cause the model to specifically "ponder" the training set, leading to a situation where it is dumbfounded when faced with questions outside the training set. It is said that "simplicity is the ultimate sophistication," and a model with good generalization performance is the same; we want the model to be "simple" and perform well on test data, as shown by the green line in the figure.
Thus, we add a regularization term, where $\lambda$ is a hyperparameter used to balance data loss and regularization loss.
This $R(W)$ is similar to the loss function and has many options:

L1 regularization: increasing sparsity in the W matrix
*L2 regularization: preventing overfitting, smoothing weight distribution.
*Elastic net regularization: combining the advantages of L1 and L2
Dropout ……

Softmax#

This figure shows how the Softmax Classifier converts unnormalized scores into probabilities and evaluates model performance through cross-entropy loss. Here, $s_{y_i}$ is the unnormalized score for the correct category, and $\sum_j e^{s_j}$ is the sum of unnormalized scores for all categories.

Softmax vs. SVM (Comparing Two Loss Functions)#

By the way, take a look at this flowchart, which is very clear: the lower left corner shows $x_i,y_i$ as the input and true label for each image in the training set, and the remaining connections encompass all the processes mentioned earlier.

Optimization#

After all this blah blah, you might still have a question: how is this "brain" trained? In other words, how to find the best W?
There are a few possible approaches:

random search: quite ineffective
follow the slope: understand using the example of going downhill. Mathematically, this is the gradient.
Iteratively, each time add a small value h to each number in W, see how much the loss value changes, and obtain dW, repeating this operation for each number in W as shown below.

Then select from it.

This method (essentially numerical gradient) becomes difficult to implement as W scales up, due to the large computation involved. Upon careful consideration, we realize that we need the derivative of the Loss function with respect to W; after all, the Loss function is a function of W, so we can directly use differentiation!

Gradient Descent#

The code is as follows:


while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad

Stochastic Gradient Descent#


while True:
  data_batch = sample_training_data(data, 256)
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad

http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/ In this interactive visualization interface, you can see the entire training process.

Image Features#

A coordinate transformation example: from Cartesian coordinates to polar coordinates, successfully separating a group of points that could not be separated by a linear classifier using a linear classifier.

Examples of Feature Representation:

Color Histogram
Histogram of Oriented Gradient (HoG)
Bag of words: Build codebook → Encode images

Summary#

The content of Lecture 3 concludes here. I haven't extended much, basically interpreting according to the course content. Personally, I think the Optimization part is not explained very well; it can only be understood in general.
The content of Lecture 4 is neural networks and backpropagation; let's move on to the next note~