Unable to run logistic regression due to "perfect separation error"
You have PerfectSeparationError
because your loansData['IR_TF'] only has a single value True
(or 1). You first converted interest rate from % to decimal, so you should define IR_TF as
loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12 #not 12, and you don't need .map
Then your regression will run successfully:
Optimization terminated successfully. Current function value: 0.319503 Iterations 8FICO.Score 0.087423Amount.Requested -0.000174Intercept -60.125045dtype: float64
Also, I noticed various places that can be made easier to read and/or gain some performance improvements in particular .map
might not be as fast as vectorized calculations. Here are my suggestions:
from scipy import statsimport numpy as npimport pandas as pd import collectionsimport matplotlib.pyplot as pltimport statsmodels.api as smloansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')## cleaning the fileloansData['Interest.Rate'] = loansData['Interest.Rate'].str.rstrip('%').astype(float).round(2) / 100.0loanlength = loansData['Loan.Length'].str.strip('months')#.astype(int) --> loanlength not used belowloansData['FICO.Score'] = loansData['FICO.Range'].str.split('-', expand=True)[0].astype(int)#add interest rate less than column and populate## we only care about interest rates less than 12%loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12#create intercept columnloansData['Intercept'] = 1.0# create list of ind var col namesind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept'] #define logistic regressionlogit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])#fit the modelresult = logit.fit()#get fitted coefcoeff = result.params#print coeffprint result.summary() #result has more informationLogit Regression Results ==============================================================================Dep. Variable: IR_TF No. Observations: 2500Model: Logit Df Residuals: 2497Method: MLE Df Model: 2Date: Thu, 07 Jan 2016 Pseudo R-squ.: 0.5243Time: 23:15:54 Log-Likelihood: -798.76converged: True LL-Null: -1679.2 LLR p-value: 0.000==================================================================================== coef std err z P>|z| [95.0% Conf. Int.]------------------------------------------------------------------------------------FICO.Score 0.0874 0.004 24.779 0.000 0.081 0.094Amount.Requested -0.0002 1.1e-05 -15.815 0.000 -0.000 -0.000Intercept -60.1250 2.420 -24.840 0.000 -64.869 -55.381====================================================================================
By the way -- is this P2P lending data?