Linear regression analysis with string/categorical features (variables)?
Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.
Usually there are three possibilities:
- One-Hot encoding for categorical data
- Arbitrary numbers for ordinal data
- Use something like group means for categorical data (e. g. mean prices for city districts).
You have to be carefull to not infuse information you do not have in the application case.
One hot encoding
If you have categorical data, you can create dummy variables with 0/1 values for each possible value.
E. g.
idx color0 blue1 green2 green3 red
to
idx blue green red0 1 0 01 0 1 02 0 1 03 0 0 1
This can easily be done with pandas:
import pandas as pddata = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})print(pd.get_dummies(data))
will result in:
color_blue color_green color_red0 1 0 01 0 1 02 0 1 03 0 0 1
Numbers for ordinal data
Create a mapping of your sortable categories, e. g.old < renovated < new → 0, 1, 2
This is also possible with pandas:
data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']})data['q'] = data['q'].astype('category')data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True)data['q'] = data['q'].cat.codesprint(data['q'])
Result:
0 01 22 23 1Name: q, dtype: int8
Using categorical data for groupby operations
You could use the mean for each category over past (known events).
Say you have a DataFrame with the last known mean prices for cities:
prices = pd.DataFrame({ 'city': ['A', 'A', 'A', 'B', 'B', 'C'], 'price': [1, 1, 1, 2, 2, 3],})mean_price = prices.groupby('city').mean()data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})print(data.merge(mean_price, on='city', how='left'))
Result:
city price0 A 11 B 22 C 33 A 14 B 25 A 1
You can use "Dummy Coding" in this case.There are Python libraries to do dummy coding, you have a few options:
- You may use
scikit-learn
library. Take a look at here. - Or, if you are working with
pandas
, it has a built-in function to create dummy variables.
An example with pandas is below:
import pandas as pdsample_data = [[1,2,'a'],[3,4,'b'],[5,6,'c'],[7,8,'b']]df = pd.DataFrame(sample_data, columns=['numeric1','numeric2','categorical'])dummies = pd.get_dummies(df.categorical)df.join(dummies)
In linear regression with categorical variables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here
Idea is to use dummy variable encoding with drop_first=True
, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOT lose any relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.
Here is complete code on how you can do it for your housing dataset
So you have categorical features:
District, Condition, Material, Security, Type
And one numerical features that you are trying to predict:
Price
First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:
Input variables:
X = housing[['District','Condition','Material','Security','Type']]
Prediction:
Y = housing['Price']
Convert categorical variable into dummy/indicator variables and drop one in each category:
X = pd.get_dummies(data=X, drop_first=True)
So now if you check shape of X with drop_first=True
you will see that it has 4 columns less - one for each of your categorical variables.
You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:
from sklearn import linear_modelfrom sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encodingregr.fit(X_train, Y_train)predicted = regr.predict(X_test)