Neural network backprop not fully training Neural network backprop not fully training numpy numpy

Neural network backprop not fully training


Your network is learning, as can be seen from the loss chart, so backprop implementation is correct (congrats!). The main problem with this particular architecture is the choice of the activation function: sigmoid. I have replaced sigmoid with tanh and it works much better instantly.

From this discussion on CV.SE:

There are two reasons for that choice (assuming you have normalized your data, and this is very important):

  • Having stronger gradients: since data is centered around 0, the derivatives are higher. To see this, calculate the derivative of the tanh function and notice that input values are in the range [0,1]. The range of the tanh function is [-1,1] and that of the sigmoid function is [0,1]

  • Avoiding bias in the gradients. This is explained very well in the paper, and it is worth reading it to understand these issues.

Though I'm sure sigmoid-based NN can be trained as well, looks like it's much more sensitive to input values (note that they are not zero-centered), because the activation itself is not zero-centered. tanh is better than sigmoid by all means, so a simpler approach is just use that activation function.

The key change is this:

def __tanh(self, z):  return np.tanh(z)def __tanhPrime(self, a):  return 1 - self.__tanh(a) ** 2

... instead of __sigmoid and __sigmoidPrime.

I have also tuned hyperparameters a little bit, so that the network now learns in 100k epochs, instead of 5m:

prior to training: [[ 0.        ] [-0.00056925] [-0.00044885] [-0.00101794]] post training: [[0.        ] [0.97335842] [0.97340917] [0.98332273]] 

plot

A complete code is in this gist.


Well I'm an idiot. I was right about being wrong but I was wrong about how wrong I was. Let me explain.

Within the backwards training method I got the last layer trained correctly, but all layers after that wasn't trained correctly, hence why the above network was coming up with a result, it was indeed training, but only one layer.

So what did i do wrong? Well I was only multiplying by the local graident of the Weights with respect to the output, and thus the chain rule was partially correct.

Lets say the loss function was this:

t = Y-X2

loss = 1/2*(t)^2

a2 = X1W2 + b

X2 = activation(a2)

a1 = X0W1 + b

X1 = activation(a1)

We know that the the derivative of loss with respect to W2 would be -(Y-X2)*X1. This was done in the first part of my training function:

def train(self,X,Y,loss,epoch=5000000):    for i in range(epoch):        #First part        YHat = self.forward(X)        delta = -(Y-YHat)        loss.append(sum(Y-YHat))        err = np.sum(np.dot(self.__layers[-1].localGrad,delta.T), axis=1)        err.shape = (self.__hiddenDimensions[-1][0],1)        self.__layers[-1].adjustWeights(err)        i=0        #Second part        for l in reversed(self.__layers[:-1]):            err = np.dot(l.localGrad, err)            l.adjustWeights(err)            i += 1

However the second part is where I screwed up. In order to calculate the loss with respect to W1, I must multiply the original error -(Y-X2) by W2 as W2 is the local X Gradient of the last layer, and due to the chain rule this must be done first. Then I could multiply by the local W gradient (X1) to get the loss with respect to W1. I failed to do the multiplication of the local X gradient first, so the last layer was indeed training, but all layers after that had an error that magnified as the layer increased.

To solve this I updated the train method:

def train(self,X,Y,loss,epoch=10000):    for i in range(epoch):        YHat = self.forward(X)        err = -(Y-YHat)        loss.append(sum(Y-YHat))        werr = np.sum(np.dot(self.__layers[-1].localWGrad,err.T), axis=1)        werr.shape = (self.__hiddenDimensions[-1][0],1)        self.__layers[-1].adjustWeights(werr)        for l in reversed(self.__layers[:-1]):            err = np.multiply(err, l.localXGrad)            werr = np.sum(np.dot(l.weights,err.T),axis=1)            l.adjustWeights(werr)

Now the loss graph I got looks like this:

enter image description here