From Animation to Intuition

Neha Kaswate

9 months ago

Visualizing Optimization Trajectory of Neural Nets

In the previous post, I showed some animated plots for the training process of linear regression and logistic regression. Developing a good “feel” of how they “learn” is helpful because they can be used as a baseline before applying more complex models. Although most deep neural networks also use gradient-based learning, similar intuition is much harder to come by. One reason is that the parameters are very high dimensional and there are a lot of non linearity involved, it’s very hard to picture in our heads what is going on during the optimization. Unlike computers, we are only programmed to perceive spaces less or equal to 3D.
In this post, I will show a lot more animated plots to offer a glimpse into those high dimensional spaces. The models I use are fully connected multilayer perceptrons with ReLU activations. We can then visually see how width and depth can affect the loss landscape and optimization trajectory.
Let’s consider this 2D dataset of 3 classes. We humans can easily see the pattern at a glance that it’s generated by some sort of spiral function. It takes us less than a second to recognize it and, at the same time, we automatically produce a way of extrapolation for unseen data. It is because we are trained on this sort of visual task since we were born (also don’t forget we can only do this in very low dimensional space). Here, I will train a few artificial neural networks to classify this shape and inspect the training process in the parameter space.
Before we go on, I’d like to raise an important question: how do we visualize the optimization trajectory in the high dimensional parameter space?
The most straightforward way is to find two directions to cut through the high dimensional space and visualize loss values over that plane. But which two directions to use? There are infinitely many potential directions to choose from. In the paper Visualizing Loss Landscape in Neural Nets, Li et. al., the authors discussed some options and adopted one using PCA for dimensionality reduction.
Here’s a brief summary of the motivation:
  • 2 random vectors in high dimensional space have a high probability of being orthogonal, and they can hardly capture any variation for the optimization path. The path’s projection onto the plane spanned by the 2 vectors will just look like random walk.
  • If we pick one direction to be the vector pointing from the initial parameters to the final trained parameters, and another direction at random, the visualization will look like a straight line because the second direction doesn’t capture much variance compared to the first.
  • If we use principal component analysis (PCA) on the optimization path and get the top 2 components, we can visualize the loss over the 2 orthogonal directions with the most variance.
  • If we pick one direction to be the vector pointing from the initial parameters to the final trained parameters, and another direction at random, the visualization will look like a straight line because the second direction doesn’t capture much variance compared to the first.
  • If we use principal component analysis (PCA) on the optimization path and get the top 2 components, we can visualize the loss over the 2 orthogonal directions with the most variance.
  • If we use principal component analysis (PCA) on the optimization path and get the top 2 components, we can visualize the loss over the 2 orthogonal directions with the most variance.
Therefore, I am using the PCA approach for better-looking optimization paths. Keep in mind that it is not the “best” approach for path visualization because it may not work for other purposes. For example, if you aim to compare the paths taken by different optimizers, e.g. SGD vs. Adam, this PCA approach won’t work since the principal components come from the paths themselves. The fact that different optimizers have different paths and different PC directions, i.e. a different slice of the loss landscape, makes the comparison impossible. For that purpose, we should use two fixed directions.
The architectures here are produced by varying and combining the two properties below
  • Number of hidden layers: 1, 5, 10
  • Number of neurons in each hidden layer: 5, 20, 100
  • Number of neurons in each hidden layer: 5, 20, 100
9 configurations in total.
Let’s take a look at the optimization paths alongside the decision areas/boundaries produced by different models. By showing the decision areas instead of validation/test accuracies which rely on a split on the generated dataset, we can get better intuitions about bad fits since we already have our prior — the “spiral” expectation in mind.
First, let me show you what a logistic regression does in this case,
Obviously it doesn’t do a good job with its 3 straight lines, but its loss landscape is perfectly convex.
Starting with the first configuration: 1 hidden layer and 5 neurons. Each neuron with ReLU is essentially a straight line with one side activated.
A definite step up from logistic regression, but 5 neurons can’t capture the curvy shape of the spirals and struggle to reach high accuracy and low loss values. The loss landscape seems mostly convex.
Adding more neurons, we see
Notice that the variance captured by each principal component is labeled on the axes. The top component almost always gets 95%+. This is partly determined by the nature of this data and the architecture of the network. In the paper, the authors also observed similar behaviors, i.e. the optimization path lies in a very low-dimensional space for the CIFAR-10 dataset and various architectures.
Next, 5 hidden layers,
With 5 hidden layers, the loss landscapes in the PC-2 subspace become significantly less convex. With narrow 5-hidden-layer by 5-neuron setting, we can guess that the high dimensional landscape is highly nonconvex and the optimizer falls into a local valley quickly and produces a bad fit.
With 20-100 neurons and 5 layers, the optimizer gets to near-zero loss very fast. Further training just makes it fluctuate and have unexpected “glitches”. What’s happening is that the neural nets try to fit this data with strange high-dimensional shapes outside this 2D space that the data live in. It overfits quickly but doesn’t really capture the underlying data generating mechanism as we intuitively did.
Things get a bit crazy for 10 hidden layers.
They all share the same characteristics such as getting to near-zero loss very quickly and yet producing weird fits that can’t generalize well. With all others held equal, a narrower and deeper neural network is harder to train than wider and shallower ones.
Take a look at this thin and deep 20-layer net with 5 neurons in each hidden layer, the loss landscape is so nonconvex, it just gets stuck and become untrainable.
The paper discussed several aspects of neural net properties in more detail. The authors examined the effect of network depth, width, initialization, skip connections, and the implication of loss landscape geometry. One key takeaway is that, skip connections are crucial to making deep networks easier to train. I highly recommend this paper if you are interested in the question of why we can train deep neural networks.
A picture is worth a thousand words. Each animated plot here is 50 pictures, and we have 21 of them in this article. Comparing the number of epochs, the values of loss, and accuracy between different configurations, you can get a “feel” of the learning process of basic neural networks. The same method can be used to visualize larger and more complex architectures.
Despite the hype of deep learning today, artificial neural nets are still in its early days. Many techniques are empirical and not yet backed by theory. The best way to gain experience is through trial and error. Once a student of deep learning develops some intuition on the basics, I recommend the top-down learning approach advocated by fast.ai. A lot of breakthroughs in science and technology happened as accidents during experiments and not as natural results of nice theoretical derivations. People usually understand why something works after applying it for years. So don’t be afraid of the unknown, just get your hands dirty and train more models!
If you like this article, you can follow me on Medium and Twitter for more content in the future. Thank you!

References

  • Visualizing Loss Landscape in Neural Nets, Li et. al.
  • https://github.com/madewithml/basics
  • https://github.com/madewithml/basics
For more related articles and courses visit InsideAIML.

Submit Review