Unable to run logistic regression due to "perfect separation error"

python numpy pandas matplotlib logistic-regression

You have PerfectSeparationError because your loansData['IR_TF'] only has a single value True (or 1). You first converted interest rate from % to decimal, so you should define IR_TF as

loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12 #not 12, and you don't need .map

Then your regression will run successfully:

Optimization terminated successfully.         Current function value: 0.319503         Iterations 8FICO.Score           0.087423Amount.Requested    -0.000174Intercept          -60.125045dtype: float64

Also, I noticed various places that can be made easier to read and/or gain some performance improvements in particular .map might not be as fast as vectorized calculations. Here are my suggestions:

from scipy import statsimport numpy as npimport pandas as pd import collectionsimport matplotlib.pyplot as pltimport statsmodels.api as smloansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')## cleaning the fileloansData['Interest.Rate'] = loansData['Interest.Rate'].str.rstrip('%').astype(float).round(2) / 100.0loanlength = loansData['Loan.Length'].str.strip('months')#.astype(int)  --> loanlength not used belowloansData['FICO.Score'] = loansData['FICO.Range'].str.split('-', expand=True)[0].astype(int)#add interest rate less than column and populate## we only care about interest rates less than 12%loansData['IR_TF'] = loansData['Interest.Rate'] < 0.12#create intercept columnloansData['Intercept'] = 1.0# create list of ind var col namesind_vars = ['FICO.Score', 'Amount.Requested', 'Intercept'] #define logistic regressionlogit = sm.Logit(loansData['IR_TF'], loansData[ind_vars])#fit the modelresult = logit.fit()#get fitted coefcoeff = result.params#print coeffprint result.summary() #result has more informationLogit Regression Results                           ==============================================================================Dep. Variable:                  IR_TF   No. Observations:                 2500Model:                          Logit   Df Residuals:                     2497Method:                           MLE   Df Model:                            2Date:                Thu, 07 Jan 2016   Pseudo R-squ.:                  0.5243Time:                        23:15:54   Log-Likelihood:                -798.76converged:                       True   LL-Null:                       -1679.2                                        LLR p-value:                     0.000====================================================================================                       coef    std err          z      P>|z|      [95.0% Conf. Int.]------------------------------------------------------------------------------------FICO.Score           0.0874      0.004     24.779      0.000         0.081     0.094Amount.Requested    -0.0002    1.1e-05    -15.815      0.000        -0.000    -0.000Intercept          -60.1250      2.420    -24.840      0.000       -64.869   -55.381====================================================================================

By the way -- is this P2P lending data?

CodeHunter

Unable to run logistic regression due to "perfect separation error"

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last