plot a document tfidf 2D graph

python numpy scipy scikit-learn k-means

When you use Bag of Words, each of your sentences gets represented in a high dimensional space of length equal to the vocabulary. If you want to represent this in 2D you need to reduce the dimension, for example using PCA with two components:

from sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerfrom sklearn.decomposition import PCAfrom sklearn.pipeline import Pipelineimport matplotlib.pyplot as pltnewsgroups_train = fetch_20newsgroups(subset='train',                                       categories=['alt.atheism', 'sci.space'])pipeline = Pipeline([    ('vect', CountVectorizer()),    ('tfidf', TfidfTransformer()),])        X = pipeline.fit_transform(newsgroups_train.data).todense()pca = PCA(n_components=2).fit(X)data2D = pca.transform(X)plt.scatter(data2D[:,0], data2D[:,1], c=data.target)plt.show()              #not required if using ipython notebook

data2d

Now you can for example calculate and plot the cluster enters on this data:

from sklearn.cluster import KMeanskmeans = KMeans(n_clusters=2).fit(X)centers2D = pca.transform(kmeans.cluster_centers_)plt.hold(True)plt.scatter(centers2D[:,0], centers2D[:,1],             marker='x', s=200, linewidths=3, c='r')plt.show()              #not required if using ipython notebook

enter image description here

python numpy scipy scikit-learn k-means

Just assign a variable to the labels and use that to denote color. ex km = Kmeans().fit(X) clusters = km.labels_.tolist()then c=clusters

CodeHunter

plot a document tfidf 2D graph

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last