How to add another feature (length of text) to current bag of words classification? Scikit-learn

python machine-learning scikit-learn classification text-classification

As shown in the comments, this is a combination of a FunctionTransformer, a FeaturePipeline and a FeatureUnion.

import numpy as npfrom sklearn.pipeline import Pipeline, FeatureUnionfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm import LinearSVCfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn.preprocessing import FunctionTransformerX_train = np.array(["new york is a hell of a town",                    "new york was originally dutch",                    "new york is also called the big apple",                    "nyc is nice",                    "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",                    "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",                    "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",                    "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]])X_test = np.array(["it's a nice day in nyc",                   'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'                   ])   target_names = ['Class 1', 'Class 2']def get_text_length(x):    return np.array([len(t) for t in x]).reshape(-1, 1)classifier = Pipeline([    ('features', FeatureUnion([        ('text', Pipeline([            ('vectorizer', CountVectorizer(min_df=1,max_df=2)),            ('tfidf', TfidfTransformer()),        ])),        ('length', Pipeline([            ('count', FunctionTransformer(get_text_length, validate=False)),        ]))    ])),    ('clf', OneVsRestClassifier(LinearSVC()))])classifier.fit(X_train, y_train)predicted = classifier.predict(X_test)predicted

This will add the length of the text to the features used by the classifier.

python machine-learning scikit-learn classification text-classification

I assume that the new feature that you want to add is numeric. Here is my logic. First transform the text into sparse using TfidfTransformer or something similar. Then convert the sparse representation to a pandas DataFrame and add your new column which I assume is numeric. At the end, you may want to convert your data frame back to sparse matrix using scipy or any other module that you feel comfortable with. I assume that your data is in a pandas DataFrame called dataset containing a 'Text Column' and a 'Numeric Column'. Here is some code.

dataset = pd.DataFrame({'Text Column':['Sample Text1','Sample Text2'], 'Numeric Column': [2,1]})dataset.head()        Numeric Column   Text Column0                   2    Sample Text11                   1    Sample Text2from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformerfrom scipy import sparsetv = TfidfVectorizer(min_df = 0.05, max_df = 0.5, stop_words = 'english')X = tv.fit_transform(dataset['Text column'])vocab = tv.get_feature_names()X1 = pd.DataFrame(X.toarray(), columns = vocab)X1['Numeric Column'] = dataset['Numeric Column']X_sparse = sparse.csr_matrix(X1.values)

Finally, you may want to;

print(X_sparse.shape)print(X.shape)

to ensure that the new column was successfully added. I hope this helps.

CodeHunter

How to add another feature (length of text) to current bag of words classification? Scikit-learn

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last