# How to implement the Softmax function in Python

They're both correct, but yours is preferred from the point of view of numerical stability.

You start with

`e ^ (x - max(x)) / sum(e^(x - max(x))`

By using the fact that a^(b - c) = (a^b)/(a^c) we have

`= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))= e ^ x / sum(e ^ x)`

Which is what the other answer says. You could replace max(x) with any variable and it would cancel out.

(Well... much confusion here, both in the question and in the answers...)

To start with, the two solutions (i.e. yours and the suggested one) are **not** equivalent; they **happen** to be equivalent only for the special case of 1-D score arrays. You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example.

Results-wise, the only actual difference between the two solutions is the `axis=0`

argument. To see that this is the case, let's try your solution (`your_softmax`

) and one where the only difference is the `axis`

argument:

`import numpy as np# your solution:def your_softmax(x): """Compute softmax values for each sets of scores in x.""" e_x = np.exp(x - np.max(x)) return e_x / e_x.sum()# correct solution:def softmax(x): """Compute softmax values for each sets of scores in x.""" e_x = np.exp(x - np.max(x)) return e_x / e_x.sum(axis=0) # only difference`

As I said, for a 1-D score array, the results are indeed identical:

`scores = [3.0, 1.0, 0.2]print(your_softmax(scores))# [ 0.8360188 0.11314284 0.05083836]print(softmax(scores))# [ 0.8360188 0.11314284 0.05083836]your_softmax(scores) == softmax(scores)# array([ True, True, True], dtype=bool)`

Nevertheless, here are the results for the 2-D score array given in the Udacity quiz as a test example:

`scores2D = np.array([[1, 2, 3, 6], [2, 4, 5, 6], [3, 8, 7, 6]])print(your_softmax(scores2D))# [[ 4.89907947e-04 1.33170787e-03 3.61995731e-03 7.27087861e-02]# [ 1.33170787e-03 9.84006416e-03 2.67480676e-02 7.27087861e-02]# [ 3.61995731e-03 5.37249300e-01 1.97642972e-01 7.27087861e-02]]print(softmax(scores2D))# [[ 0.09003057 0.00242826 0.01587624 0.33333333]# [ 0.24472847 0.01794253 0.11731043 0.33333333]# [ 0.66524096 0.97962921 0.86681333 0.33333333]]`

The results are different - the second one is indeed identical with the one expected in the Udacity quiz, where all columns indeed sum to 1, which is not the case with the first (wrong) result.

So, all the fuss was actually for an implementation detail - the `axis`

argument. According to the numpy.sum documentation:

The default, axis=None, will sum all of the elements of the input array

while here we want to sum row-wise, hence `axis=0`

. For a 1-D array, the sum of the (only) row and the sum of all the elements happen to be identical, hence your identical results in that case...

The `axis`

issue aside, your implementation (i.e. your choice to subtract the max first) is actually **better** than the suggested solution! In fact, it is the recommended way of implementing the softmax function - see here for the justification (numeric stability, also pointed out by some other answers here).

So, this is really a comment to desertnaut's answer but I can't comment on it yet due to my reputation. As he pointed out, your version is only correct if your input consists of a single sample. If your input consists of several samples, it is wrong. **However, desertnaut's solution is also wrong.** The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input. Let me show this to you.

`import numpy as np# your solution:def your_softmax(x): """Compute softmax values for each sets of scores in x.""" e_x = np.exp(x - np.max(x)) return e_x / e_x.sum()# desertnaut solution (copied from his answer): def desertnaut_softmax(x): """Compute softmax values for each sets of scores in x.""" e_x = np.exp(x - np.max(x)) return e_x / e_x.sum(axis=0) # only difference# my (correct) solution:def softmax(z): assert len(z.shape) == 2 s = np.max(z, axis=1) s = s[:, np.newaxis] # necessary step to do broadcasting e_x = np.exp(z - s) div = np.sum(e_x, axis=1) div = div[:, np.newaxis] # dito return e_x / div`

Lets take desertnauts example:

`x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)`

This is the output:

`your_softmax(x1)array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])desertnaut_softmax(x1)array([[ 1., 1., 1., 1.]])softmax(x1)array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])`

You can see that desernauts version would fail in this situation. (It would not if the input was just one dimensional like np.array([1, 2, 3, 6]).

Lets now use 3 samples since thats the reason why we use a 2 dimensional input. The following x2 is not the same as the one from desernauts example.

`x2 = np.array([[1, 2, 3, 6], # sample 1 [2, 4, 5, 6], # sample 2 [1, 2, 3, 6]]) # sample 1 again(!)`

This input consists of a batch with 3 samples. But sample one and three are essentially the same. We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1!

`your_softmax(x2)array([[ 0.00183535, 0.00498899, 0.01356148, 0.27238963], [ 0.00498899, 0.03686393, 0.10020655, 0.27238963], [ 0.00183535, 0.00498899, 0.01356148, 0.27238963]])desertnaut_softmax(x2)array([[ 0.21194156, 0.10650698, 0.10650698, 0.33333333], [ 0.57611688, 0.78698604, 0.78698604, 0.33333333], [ 0.21194156, 0.10650698, 0.10650698, 0.33333333]])softmax(x2)array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047], [ 0.01203764, 0.08894682, 0.24178252, 0.65723302], [ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])`

I hope you can see that this is only the case with my solution.

`softmax(x1) == softmax(x2)[0]array([[ True, True, True, True]], dtype=bool)softmax(x1) == softmax(x2)[2]array([[ True, True, True, True]], dtype=bool)`

Additionally, here is the results of TensorFlows softmax implementation:

`import tensorflow as tfimport numpy as npbatch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])x = tf.placeholder(tf.float32, shape=[None, 4])y = tf.nn.softmax(x)init = tf.initialize_all_variables()sess = tf.Session()sess.run(y, feed_dict={x: batch})`

And the result:

`array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037045], [ 0.01203764, 0.08894681, 0.24178252, 0.657233 ], [ 0.00626879, 0.01704033, 0.04632042, 0.93037045]], dtype=float32)`