Extracting one-hot vector from text

python numpy pandas vector nlp

There are various packages that will do all the steps in a single function such as http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

Alternatively, if you have your vocabulary and text indexes for each sentence already, you can create a one-hot encoding by preallocating and using smart indexing. In the following text_idx is a list of integers and vocab is a list relating integers indexes to words.

import numpy as npvocab_size = len(vocab)text_length = len(text_idx)one_hot = np.zeros(([vocab_size, text_length])one_hot[text_idx, np.arange(text_length)] = 1

python numpy pandas vector nlp

to create one_hot_vector, you need to create unique vocabulary from text

vocab=set(vocab)label_encoder = LabelEncoder()integer_encoded = label_encoder.fit_transform(vocab)one_hot_encoder = OneHotEncoder(sparse=False)doc = "dog"index=vocab.index(doc)integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)one_hot_encoder=one_hot_encoder.fit_transform(integer_encoded)[index]

python numpy pandas vector nlp

The 7th value is the "."(Dot) in your sentences separated by a " "(space) and split() counts it as a word !!

CodeHunter

Extracting one-hot vector from text

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last