Building SVM with tensorflow's LinearClassifier and Panda's Dataframes
Linear SVM
SVM is a max margin classifier, i.e. it maximizes the width or the margin separating the positive class from the negative class. The loss function of linear SVM in case of binary classification is given below.
It can be derived from the more generalized multi class linear SVM loss (also called hinge loss) shown below (with Δ = 1).
Note: In all the above equations, the weight vector w
includes bias b
How on the earth did someone came up with this loss? Lets dig in.
Image above shows the data points belonging to positive class separated from the data point belonging to the negative class by a separating hyperplane (shown as solid line). However, there can be many such separating hyperplanes. SVM finds the separating hyperplane such that the distance of the hyperplane to the nearest positive data point and to the nearest negative data point is maximum (shown as dotted line).
Mathematically, SVM finds the weight vector w
(bias included) such that
If the labels(y
) of +ve class and -ve class are +1
and -1
respectively, then SVM finds w
such that
• If a data point is on the correct side of the hyperplane (correctly classified) then
• If a data point is on the wrong side (miss classified) then
So the loss for a data point, which is a measure of miss classification can be written as
Regularization
If a weight vector w
correctly classifies the data (X
) then any multiple of these weight vector λw
where λ>1
will also correctly classifies the data ( zero loss). This is because the transformation λW
stretches all score magnitudes and hence also their absolute differences. L2 regularization penalizes the large weights by adding the regularization loss to the hinge loss.
For example, if x=[1,1,1,1]
and two weight vectors w1=[1,0,0,0]
, w2=[0.25,0.25,0.25,0.25]
. Then dot(W1,x) =dot(w2,x) =1
i.e. both the weight vectors lead to the same dot product and hence same hinge loss. But the L2 penalty of w1
is 1.0
while the L2 penalty of w2
is only 0.25
. Hence L2 regularization prefers w2
over w1
. The classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly. This improve the generalization of the model and lead to less overfitting.
L2 penalty leads to the max margin property in SVMs. If the SVM is expressed as an optimization problem then the generalized Lagrangian form for the constrained quadratic optimization problem is as below
Now that we know the loss function of linear SVM we can use gradient decent (or other optimizers) to find the weight vectors which minimizes the loss.
Code
import tensorflow as tfimport numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasets# Load Datairis = datasets.load_iris()X = iris.data[:, :2][iris.target != 2]y = iris.target[iris.target != 2]# Change labels to +1 and -1 y = np.where(y==1, y, -1)# Linear Model with L2 regularizationmodel = tf.keras.Sequential()model.add(tf.keras.layers.Dense(1, activation='linear', kernel_regularizer=tf.keras.regularizers.l2()))# Hinge lossdef hinge_loss(y_true, y_pred): return tf.maximum(0., 1- y_true*y_pred)# Train the modelmodel.compile(optimizer='adam', loss=hinge_loss)model.fit(X, y, epochs=50000, verbose=False)# Plot the learned decision boundary x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))Z = model.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)cs = plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)plt.show()
SVM can also be expressed as a constrained quadratic optimization problem. The advantage of this formulation is that we can use the kernel trick to classify non linearly separable data (using different kernels). LIBSVM implements the Sequential minimal optimization (SMO) algorithm for kernelized support vector machines (SVMs).
Code
from sklearn.svm import SVC# SVM with linear kernelclf = SVC(kernel='linear')clf.fit(X, y) # Plot the learned decision boundary x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))Z = model.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)cs = plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1)plt.show()
Finally
The Linear SVM model using tf which you can use for your problem statement is
# Prepare Data # 10 Binary featuresdf = pd.DataFrame(np.random.randint(0,2,size=(1000, 10)))# 1 floating value feature df[11] = np.random.uniform(0,100000, size=(1000))# True Label df[12] = pd.DataFrame(np.random.randint(0, 2, size=(1000)))# Convert data to zero mean unit variance scalar = StandardScaler().fit(df[df.columns.drop(12)])X = scalar.transform(df[df.columns.drop(12)])y = np.array(df[12])# convert label to +1 and -1. Needed for hinge lossy = np.where(y==1, +1, -1)# Model model = tf.keras.Sequential()model.add(tf.keras.layers.Dense(1, activation='linear', kernel_regularizer=tf.keras.regularizers.l2()))# Hinge Lossdef my_loss(y_true, y_pred): return tf.maximum(0., 1- y_true*y_pred)# Train model model.compile(optimizer='adam', loss=my_loss)model.fit(X, y, epochs=100, verbose=True)
K-Fold cross validation and making predictions
import tensorflow as tfimport numpy as npimport matplotlib.pyplot as pltfrom sklearn import datasetsfrom sklearn.model_selection import KFoldfrom sklearn.metrics import roc_curve, auc# Load Datairis = datasets.load_iris()X = iris.data[:, :2][iris.target != 2]y_ = iris.target[iris.target != 2]# Change labels to +1 and -1 y = np.where(y_==1, +1, -1)# Hinge lossdef hinge_loss(y_true, y_pred): return tf.maximum(0., 1- y_true*y_pred)def get_model(): # Linear Model with L2 regularization model = tf.keras.Sequential() model.add(tf.keras.layers.Dense(1, activation='linear', kernel_regularizer=tf.keras.regularizers.l2())) model.compile(optimizer='adam', loss=hinge_loss) return modeldef sigmoid(x): return 1 / (1 + np.exp(-x))predict = lambda model, x : sigmoid(model.predict(x).reshape(-1))predict_class = lambda model, x : np.where(predict(model, x)>0.5, 1, 0)kf = KFold(n_splits=2, shuffle=True)# K Fold cross validationbest = (None, -1)for i, (train_index, test_index) in enumerate(kf.split(X)): X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] model = get_model() model.fit(X_train, y_train, epochs=5000, verbose=False, batch_size=128) y_pred = model.predict_classes(X_test) val = roc_auc_score(y_test, y_pred) print ("CV Fold {0}: AUC: {1}".format(i+1, auc)) if best[1] < val: best = (model, val)# ROC Curve using the best modely_score = predict(best[0], X)fpr, tpr, _ = roc_curve(y_, y_score)roc_auc = auc(fpr, tpr)print (roc_auc)# Plot ROCplt.figure()lw = 2plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.legend(loc="lower right")plt.show()# Make predictionsy_score = predict_class(best[0], X)
Making predictions
Since the output of the model is linear we have to normalize it to probabilities to make predictions. If it is a binary classification we can use sigmoid
of if it is a multiclass classification then we can use softmax
. Below code is for binary classification
predict = lambda model, x : sigmoid(model.predict(x).reshape(-1))predict_class = lambda model, x : np.where(predict(model, x)>0.5, 1, 0)
References
Update 1:
To made the code compatible with tf 2.0 the datatype of y
should be same as X
. To do this, after line y = np.where(.....
add the line y = y.astype(np.float64)
.
Since all of your features are already numerical you can use them as they are.
df = pd.DataFrame(np.random.randint(0,2,size=(100, 12)), columns=list('ABCDEFGHIJKL'))df['K'] = np.random.random(100)nuemric_features = [tf.feature_column.numeric_column(column) for column in df.columns[:11]]model = tf.estimator.LinearClassifier(feature_columns=nuemric_features)tf_val = tf.estimator.inputs.pandas_input_fn(df.iloc[:,:11], df.iloc[:,11], shuffle=True)model.train(input_fn=tf_val, steps=1000)print(list(model.predict(input_fn=tf_val))[0]){'logits': array([-1.7512109], dtype=float32), 'logistic': array([0.14789453], dtype=float32), 'probabilities': array([0.8521055 , 0.14789453], dtype=float32), 'class_ids': array([0]), 'classes': array([b'0'], dtype=object)}
The probabilities of the prediction output is most likely what you are interested in. You have two probabilities, one for the target being Flase and one for True.
If you want to have more details look at this nice blog-post about binary classification with TensorFlow.