Extracting one-hot vector from text Extracting one-hot vector from text numpy numpy

Extracting one-hot vector from text


There are various packages that will do all the steps in a single function such as http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

Alternatively, if you have your vocabulary and text indexes for each sentence already, you can create a one-hot encoding by preallocating and using smart indexing. In the following text_idx is a list of integers and vocab is a list relating integers indexes to words.

import numpy as npvocab_size = len(vocab)text_length = len(text_idx)one_hot = np.zeros(([vocab_size, text_length])one_hot[text_idx, np.arange(text_length)] = 1


to create one_hot_vector, you need to create unique vocabulary from text

vocab=set(vocab)label_encoder = LabelEncoder()integer_encoded = label_encoder.fit_transform(vocab)one_hot_encoder = OneHotEncoder(sparse=False)doc = "dog"index=vocab.index(doc)integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)one_hot_encoder=one_hot_encoder.fit_transform(integer_encoded)[index]


The 7th value is the "."(Dot) in your sentences separated by a " "(space) and split() counts it as a word !!