Freezing

Freezing

Freeeeeezing......
twitter
github

Stanford CS231n (2017 version) -- Lecture 3: Loss Function and Optimization

After rejecting the Nearest Neighbor in Lecture 2, the next thing to try is Linear Classification.

Introduction#

We use a parametric approach for classification.
image
As shown in the figure above, we multiply a 103072 parameter matrix W by x (where x is obtained by dividing an image into 3232 pixel blocks, each block having three channel values, and finally flattening these 32323 values into a one-dimensional vector) to get a 101 result, which contains the scores for 10 different labels. The label with the highest score is the predicted label by the model.
You might be curious: what would this W matrix look like if visualized? Would it be quite abstract?
We restore each row of the 10
3072 W matrix back to a square image, as shown below.
image
On a small dataset, we can slightly see some features recognizable by the human eye.

Multiclass SVM loss#

This W looks like the "brain" of the decision, but with so many numbers inside, how do we choose it?
We introduce the "Loss Function" to quantify our dissatisfaction with the prediction results.
We use a simpler example with three categories: frog, cat, car.

image

Imagine a scenario: the teacher asks a student to answer a multiple-choice question (the correct answer is A), and the student hesitantly says, "I probably choose A, but I might choose B." Clearly, he does not grasp the knowledge as well as those who confidently answer option A. Similarly, we want the highest score for cat to be significantly greater than the scores for car and frog, indicating that the model confidently chose the correct answer.
That's what Multiclass SVM loss is about, Li=jyimax(0,SjSyi+1)L_i = \sum_{j \neq y_i} \max(0, S_j - S_{y_i} + 1) . When we predict the category of the first image above, we get three scores in the first column, at which point SyiS_{y_i} is 3.2 (the score corresponding to the actual label cat, even though it is smaller than 5.1), because we know the actual label of this image is cat. Next, we act as the teacher to evaluate how this classifier (the weight W) performed. Our criterion is: the score corresponding to the true label must be at least 1 point higher than the other scores to be considered that you really understand. 3.2 - 1 < 5.1, that's a big mistake, a serious reprimand. 3.2 - 1 > -1.7, good, no confusion with frog. The final Li=2.9L_i = 2.9, note!! The larger LiL_i indicates a poorer understanding, so a lower value indicates better model comprehension.
The prediction for the second image is good, 4.9 is at least 1 greater than the other two scores; the prediction for the third image is the worst, with losses reaching 12.9.

Regularization#

The above Loss Function seems to measure the model's capability, but is it enough to have a W trained on the training set that results in lower Loss values during training?
Not enough,
image
As shown by the blue line in the figure, if the Loss Function only has this term, blindly learning on the training set may cause the model to specifically "ponder" the training set, leading to a situation where it is dumbfounded when faced with questions outside the training set. It is said that "simplicity is the ultimate sophistication," and a model with good generalization performance is the same; we want the model to be "simple" and perform well on test data, as shown by the green line in the figure.
Thus, we add a regularization term, where λ\lambda is a hyperparameter used to balance data loss and regularization loss.
This R(W)R(W) is similar to the loss function and has many options:

  • L1 regularization: increasing sparsity in the W matrix
  • *L2 regularization: preventing overfitting, smoothing weight distribution.
  • *Elastic net regularization: combining the advantages of L1 and L2
  • Dropout ……
    image

Softmax#

image
This figure shows how the Softmax Classifier converts unnormalized scores into probabilities and evaluates model performance through cross-entropy loss. Here, syis_{y_i} is the unnormalized score for the correct category, and jesj\sum_j e^{s_j} is the sum of unnormalized scores for all categories.

Softmax vs. SVM (Comparing Two Loss Functions)#

image

By the way, take a look at this flowchart, which is very clear: the lower left corner shows xi,yi x_i,y_i as the input and true label for each image in the training set, and the remaining connections encompass all the processes mentioned earlier.
image

Optimization#

After all this blah blah, you might still have a question: how is this "brain" trained? In other words, how to find the best W?
There are a few possible approaches:

  1. random search: quite ineffective

  2. follow the slope: understand using the example of going downhill. Mathematically, this is the gradient.
    Iteratively, each time add a small value h to each number in W, see how much the loss value changes, and obtain dW, repeating this operation for each number in W as shown below.
    image
    Then select from it.

    This method (essentially numerical gradient) becomes difficult to implement as W scales up, due to the large computation involved. Upon careful consideration, we realize that we need the derivative of the Loss function with respect to W; after all, the Loss function is a function of W, so we can directly use differentiation!

Gradient Descent#

The code is as follows:


while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad

Stochastic Gradient Descent#


while True:
  data_batch = sample_training_data(data, 256)
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad

http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/ In this interactive visualization interface, you can see the entire training process.

Image Features#

image

A coordinate transformation example: from Cartesian coordinates to polar coordinates, successfully separating a group of points that could not be separated by a linear classifier using a linear classifier.

Examples of Feature Representation:

  1. Color Histogram
  2. Histogram of Oriented Gradient (HoG)
  3. Bag of words: Build codebook → Encode images

Summary#

The content of Lecture 3 concludes here. I haven't extended much, basically interpreting according to the course content. Personally, I think the Optimization part is not explained very well; it can only be understood in general.
The content of Lecture 4 is neural networks and backpropagation; let's move on to the next note~

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.