Using counts and tfidf as features with scikit learn
The error didn't come from the FeatureUnion
, it came from the TfidfTransformer
You should use TfidfVectorizer
instead of TfidfTransformer
, the transformer expects a numpy array as input and not plaintext, hence the TypeError
Also your test sentence is too small for Tfidf testing so try using a bigger one, here's an example:
from nltk.corpus import brownfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom sklearn.pipeline import FeatureUnionfrom sklearn.naive_bayes import MultinomialNB# Let's get more text from NLTKtext = [" ".join(i) for i in brown.sents()[:100]]# I'm just gonna assign random tags.labels = ['yes']*50 + ['no']*50count_vectorizer = CountVectorizer(stop_words="english", min_df=3)tf_transformer = TfidfVectorizer(use_idf=True)combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text)classifier = MultinomialNB()classifier.fit(combined_features, labels)