Some Common Problems
- Underflow or Overflow. Often results in using NaN in a matrix op. A common fix is to reduce the learning rate.
- The output is always the same, for example a classifier will always return the first or last category.
- Average loss does not decrease
Monitor the Calculation
- Track the average loss at each Epoch, but also at milestones.
Tune Hyper-Parameters
- Decrease the learning rate. Dont be afraid to go to 0.0001 or lower. A smaller learning rate should yield slower (but steady) progress, with less chance of ping-ponging between over and under estimates.
- Try different batch sizes. I have not found this to be useful, but it is common advice. In theory, a larger batch size should group the changes to the weight matrices across more samples. As a result, the change due to a few outlier samples will be averaged in, and the weight matrices will have a less volatile series of changes.
- Increase the size of hidden layers. I have not found this to be useful, but it is common advice.
- Use non-linears that output a result similar to the ground truth. Tanh outputs -1 to 1, RelU outputs 0 to 1 and loses negative numbers. I have not found this to be useful, but it is common advice.
- Zero out the bias vector of all Linear Units. The theory behind this is that the bias (maybe only in the last layer) is responsible for most of the prediction and the weights and input have less impact. I tried this but it also didn't help.
- Normalize inputs correctly. This worked for me when I had a classifier that always returned the same class. In my case, I did: data = (data - mean) / (max - min)