Optimizing the Learning Rate of your Neural Networks
The learning rate is one of those pesky hyperparameters that as data scientists it’s our job to fine tune. It is the rate at which the weights in a model are updated based on the current error. The loss function is convex and for our purposes here continuous (If the loss function is not continuous then you can run into local minima that cause all sorts of problems but that’s a story for another day). In other words if we tried every possible weight a parameter then plotted the loss at each point it would look something like this:
Our goal is to minimize this loss function, aka get to the bottom of the curve and we want to do so with as little computing power as possible. A proper learning rate is a balancing act, too big and you can miss the optimal point and too small and your algorithm could take forever. As google’s machine learning class describes it you are looking for the “goldilocks learning rate.”
For quite a while, achieving the proper learning rate was one of those more art than science comes with experience data science skills where you would basically make the best guess based on intuition and personal beliefs. Recently however a very intuitive technique based around using a very small learning rate than increasing it across mini-batches (not epochs) until an optimal learning rate is found. This idea sounds very intuitive! Surprisingly it was only discovered in 2015 by Leslie Smith at Cornell university
Sidenote, Jeremy Howard in the fastai course lecture takes a second to lament the fact that this track was ignored by many in the community and in doing so points out two important lessons. Often much can be accomplished by simplifying problems not making them more complex. And that the machine learning community prefers automatized solutions to ones that require human inputs. This opens the world up to you, yes you, there is plenty of room for people with different backgrounds to learn ai and not only learn it but offer invaluable perspectives on how to make it better.
Anyways, back to the topic at hand, in Fastai, all you need to do to is use something aptly title the learning rate finder, which you can call on your already defined neural net by using yourmodel.find_lr. Doing so will produce a graph something like this:
As you can see, the loss is steadily decreasing while the learning rate increases until it begins to sharply increase. Fastai recommends you to use a point a little bit before the learning rate begins this sharp increase.
The method the learning rate finder uses is not the only modern technique for finding learning rates. A Lot more research has been done into finding the optimal learning rate automatically. One of these that much of the community has coalesced behind is Adagrad (link). Which uses the average of previous gradients to figure out whether the loss curve is steep or shallow and adjust accordingly. It is way less simple and the learning rate finder achieves similar rates from what I have found but it also would not be surprising to see an automatic solution become the standard as computing power steadily increases and new insights are gained.
This blog points out that the learning rate finder can generate fairly different graphs depending on the initial weights of the model. The author supplies that current state-of-the-art networks generate more reproducible results.
I like the learning rate finder for beginners. It is easy and low cost to implement and can is based on very intuitive concepts. Different techniques may be offering very small or no changes in accuracy and Maggio fails to report on whether the different learning rates the LRF found actually affect accuracy, which is ultimately the goal of any model. Furthermore, simple techniques may have their own value even if not optimal since there are many non-academic applications for neural nets that could require less resource-intensive solutions.
Just because a learning rate is good for this epoch, it does not mean it will work for subsequent epochs. Using fit_one_cycle, you can specify both how many epochs you want to run and what your maximum learning rate should be. I say maximum because fastai automatically slowly scales your learning right to this max point then tails it off. This is based on another insight gained by leslie smith, who showed that it as an optimal technique for neural network training. After the number of epochs you can call find_lr again. This graph will look a little different from the first one:
Another advantage to an automatic learning rate finder is that it could make adjustments like this in real time.
It is fairly flat then the loss starts to increase sharply. This is because at this point your model is already somewhat optimized and large gains will not be made simply based on manipulating the learning rate. Remember we want the learning rate that trains the model the quickest all else being equal. So you should choose a point close to this decline then divide by ten just to be safe.
Discriminative Learning Rates
A discriminative learning rate is when you train a neural net with different learning rates for different layers. As far I can tell the main application for these is when you are fine-tuning a pre-trained model for your own classifier. A common example is using something like resnet34 which is trained on a huge dataset and adapting it for your own needs. In doing so, rather than have the last layer be a classification it is randomly generated then the weights are learned over time. When doing a task like this, different layers will need different learning rates.
By defualt, fastai actually uses discrimantive learning rates. Using lr_max=slice() you can specify while training your model for one cycle which learning rate you want the lowest layer to use and what you want the other layers to scale too.
Learning rates are an important part of optimizing a neural net efficiently. Recently very effective methods have been developed for doing so, some simpler and requiring more intuition while others are automatic but complicated to implememnt. Neural networks also benefit from different learning rates for different layers and epochs.