Unable to run logistic regression due to "perfect separation error" Unable to run logistic regression due to "perfect separation error" numpy numpy

Unable to run logistic regression due to "perfect separation error"


You have PerfectSeparationError because your loansData['IR_TF'] only has a single value True (or 1). You first converted interest rate from % to decimal, so you should define IR_TF as

loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12 #not 12, and you don't need .map 

Then your regression will run successfully:

Optimization terminated successfully.         Current function value: 0.319503         Iterations 8FICO.Score           0.087423Amount.Requested    -0.000174Intercept          -60.125045dtype: float64

Also, I noticed various places that can be made easier to read and/or gain some performance improvements in particular .map might not be as fast as vectorized calculations. Here are my suggestions:

from scipy import statsimport numpy as npimport pandas as pd import collectionsimport matplotlib.pyplot as pltimport statsmodels.api as smloansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')## cleaning the fileloansData['Interest.Rate'] = loansData['Interest.Rate'].str.rstrip('%').astype(float).round(2) / 100.0loanlength = loansData['Loan.Length'].str.strip('months')#.astype(int)  --> loanlength not used belowloansData['FICO.Score'] = loansData['FICO.Range'].str.split('-', expand=True)[0].astype(int)#add interest rate less than column and populate## we only care about interest rates less than 12%loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12#create intercept columnloansData['Intercept'] = 1.0# create list of ind var col namesind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept'] #define logistic regressionlogit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])#fit the modelresult = logit.fit()#get fitted coefcoeff = result.params#print coeffprint result.summary() #result has more informationLogit Regression Results                           ==============================================================================Dep. Variable:                  IR_TF   No. Observations:                 2500Model:                          Logit   Df Residuals:                     2497Method:                           MLE   Df Model:                            2Date:                Thu, 07 Jan 2016   Pseudo R-squ.:                  0.5243Time:                        23:15:54   Log-Likelihood:                -798.76converged:                       True   LL-Null:                       -1679.2                                        LLR p-value:                     0.000====================================================================================                       coef    std err          z      P>|z|      [95.0% Conf. Int.]------------------------------------------------------------------------------------FICO.Score           0.0874      0.004     24.779      0.000         0.081     0.094Amount.Requested    -0.0002    1.1e-05    -15.815      0.000        -0.000    -0.000Intercept          -60.1250      2.420    -24.840      0.000       -64.869   -55.381====================================================================================

By the way -- is this P2P lending data?