How to implement the Softmax function in Python How to implement the Softmax function in Python python python

How to implement the Softmax function in Python


They're both correct, but yours is preferred from the point of view of numerical stability.

You start with

e ^ (x - max(x)) / sum(e^(x - max(x))

By using the fact that a^(b - c) = (a^b)/(a^c) we have

= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))= e ^ x / sum(e ^ x)

Which is what the other answer says. You could replace max(x) with any variable and it would cancel out.


(Well... much confusion here, both in the question and in the answers...)

To start with, the two solutions (i.e. yours and the suggested one) are not equivalent; they happen to be equivalent only for the special case of 1-D score arrays. You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example.

Results-wise, the only actual difference between the two solutions is the axis=0 argument. To see that this is the case, let's try your solution (your_softmax) and one where the only difference is the axis argument:

import numpy as np# your solution:def your_softmax(x):    """Compute softmax values for each sets of scores in x."""    e_x = np.exp(x - np.max(x))    return e_x / e_x.sum()# correct solution:def softmax(x):    """Compute softmax values for each sets of scores in x."""    e_x = np.exp(x - np.max(x))    return e_x / e_x.sum(axis=0) # only difference

As I said, for a 1-D score array, the results are indeed identical:

scores = [3.0, 1.0, 0.2]print(your_softmax(scores))# [ 0.8360188   0.11314284  0.05083836]print(softmax(scores))# [ 0.8360188   0.11314284  0.05083836]your_softmax(scores) == softmax(scores)# array([ True,  True,  True], dtype=bool)

Nevertheless, here are the results for the 2-D score array given in the Udacity quiz as a test example:

scores2D = np.array([[1, 2, 3, 6],                     [2, 4, 5, 6],                     [3, 8, 7, 6]])print(your_softmax(scores2D))# [[  4.89907947e-04   1.33170787e-03   3.61995731e-03   7.27087861e-02]#  [  1.33170787e-03   9.84006416e-03   2.67480676e-02   7.27087861e-02]#  [  3.61995731e-03   5.37249300e-01   1.97642972e-01   7.27087861e-02]]print(softmax(scores2D))# [[ 0.09003057  0.00242826  0.01587624  0.33333333]#  [ 0.24472847  0.01794253  0.11731043  0.33333333]#  [ 0.66524096  0.97962921  0.86681333  0.33333333]]

The results are different - the second one is indeed identical with the one expected in the Udacity quiz, where all columns indeed sum to 1, which is not the case with the first (wrong) result.

So, all the fuss was actually for an implementation detail - the axis argument. According to the numpy.sum documentation:

The default, axis=None, will sum all of the elements of the input array

while here we want to sum row-wise, hence axis=0. For a 1-D array, the sum of the (only) row and the sum of all the elements happen to be identical, hence your identical results in that case...

The axis issue aside, your implementation (i.e. your choice to subtract the max first) is actually better than the suggested solution! In fact, it is the recommended way of implementing the softmax function - see here for the justification (numeric stability, also pointed out by some other answers here).


So, this is really a comment to desertnaut's answer but I can't comment on it yet due to my reputation. As he pointed out, your version is only correct if your input consists of a single sample. If your input consists of several samples, it is wrong. However, desertnaut's solution is also wrong. The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input. Let me show this to you.

import numpy as np# your solution:def your_softmax(x):    """Compute softmax values for each sets of scores in x."""    e_x = np.exp(x - np.max(x))    return e_x / e_x.sum()# desertnaut solution (copied from his answer): def desertnaut_softmax(x):    """Compute softmax values for each sets of scores in x."""    e_x = np.exp(x - np.max(x))    return e_x / e_x.sum(axis=0) # only difference# my (correct) solution:def softmax(z):    assert len(z.shape) == 2    s = np.max(z, axis=1)    s = s[:, np.newaxis] # necessary step to do broadcasting    e_x = np.exp(z - s)    div = np.sum(e_x, axis=1)    div = div[:, np.newaxis] # dito    return e_x / div

Lets take desertnauts example:

x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

This is the output:

your_softmax(x1)array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])desertnaut_softmax(x1)array([[ 1.,  1.,  1.,  1.]])softmax(x1)array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

You can see that desernauts version would fail in this situation. (It would not if the input was just one dimensional like np.array([1, 2, 3, 6]).

Lets now use 3 samples since thats the reason why we use a 2 dimensional input. The following x2 is not the same as the one from desernauts example.

x2 = np.array([[1, 2, 3, 6],  # sample 1               [2, 4, 5, 6],  # sample 2               [1, 2, 3, 6]]) # sample 1 again(!)

This input consists of a batch with 3 samples. But sample one and three are essentially the same. We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1!

your_softmax(x2)array([[ 0.00183535,  0.00498899,  0.01356148,  0.27238963],       [ 0.00498899,  0.03686393,  0.10020655,  0.27238963],       [ 0.00183535,  0.00498899,  0.01356148,  0.27238963]])desertnaut_softmax(x2)array([[ 0.21194156,  0.10650698,  0.10650698,  0.33333333],       [ 0.57611688,  0.78698604,  0.78698604,  0.33333333],       [ 0.21194156,  0.10650698,  0.10650698,  0.33333333]])softmax(x2)array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047],       [ 0.01203764,  0.08894682,  0.24178252,  0.65723302],       [ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

I hope you can see that this is only the case with my solution.

softmax(x1) == softmax(x2)[0]array([[ True,  True,  True,  True]], dtype=bool)softmax(x1) == softmax(x2)[2]array([[ True,  True,  True,  True]], dtype=bool)

Additionally, here is the results of TensorFlows softmax implementation:

import tensorflow as tfimport numpy as npbatch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])x = tf.placeholder(tf.float32, shape=[None, 4])y = tf.nn.softmax(x)init = tf.initialize_all_variables()sess = tf.Session()sess.run(y, feed_dict={x: batch})

And the result:

array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037045],       [ 0.01203764,  0.08894681,  0.24178252,  0.657233  ],       [ 0.00626879,  0.01704033,  0.04632042,  0.93037045]], dtype=float32)