How to convert a list of strings into a tensor in pytorch?

python numpy pytorch

Unfortunately, you can't right now. And I don't think it is a good idea since it will make PyTorch clumsy. A popular workaround could convert it into numeric types using sklearn.

Here is a short example:

from sklearn import preprocessingimport torchlabels = ['cat', 'dog', 'mouse', 'elephant', 'pandas']le = preprocessing.LabelEncoder()targets = le.fit_transform(labels)# targets: array([0, 1, 2, 3])targets = torch.as_tensor(targets)# targets: tensor([0, 1, 2, 3])

Since you may need the conversion between true labels and transformed labels, it is good to store the variable le.

python numpy pytorch

The trick is first to find out max length of a word in the list, and then at the second loop populate the tensor with zeros padding. Note that utf8 strings take two bytes per char.

In[]import torchwords = ['שלום', 'beautiful', 'world']max_l = 0ts_list = []for w in words:    ts_list.append(torch.ByteTensor(list(bytes(w, 'utf8'))))    max_l = max(ts_list[-1].size()[0], max_l)w_t = torch.zeros((len(ts_list), max_l), dtype=torch.uint8)for i, ts in enumerate(ts_list):    w_t[i, 0:ts.size()[0]] = tsw_tOut[]tensor([[215, 169, 215, 156, 215, 149, 215, 157,   0],        [ 98, 101,  97, 117, 116, 105, 102, 117, 108],        [119, 111, 114, 108, 100,   0,   0,   0,   0]], dtype=torch.uint8)

python numpy pytorch

If you don't want to use sklearn, another solution could be to keep your original list and create an extra indices list, which you can use to refer back to your original values afterwards. I specifically needed this, when I had to keep track of my original string, while batching the tokenized string.

Example below:

labels = ['cat', 'dog', 'mouse']sentence_idx = np.linspace(0,len(labels), len(labels), False)# [0, 1, 2]torch_idx = torch.tensor(sentence_idx)# do what ever you would like from torch eg. pass it to a dataloaderdataset = TensorDataset(torch_idx)loader = DataLoader(dataset, batch_size=1, shuffle=True)for batch in iter(loader):    print(batch[0])    print(labels[int(batch[0].item())])# output:# tensor([0.], dtype=torch.float64)# cat# tensor([1.], dtype=torch.float64)# dog# tensor([2.], dtype=torch.float64)# mouse

For my specific use case, the code looked like this:

input_ids, attention_masks, labels = tokenize_sentences(tokenizer, sentences, labels, max_length)# create a indexes tensor to keep track of original sentence indexsentence_idx = np.linspace(0,len(sentences), len(sentences),False )torch_idx = torch.tensor(sentence_idx)dataset = TensorDataset(input_ids, attention_masks, labels, torch_idx)loader = DataLoader(dataset, batch_size=1, shuffle=True)for batch in loader:    _, logit = model(batch[0],                      token_type_ids=None,                     attention_mask=batch[1],                     labels=batch[2])    pred_flat = np.argmax(logit.detach(), axis=1).flatten()    print(pred_flat)    print(batch[2])    if pred_flat == batch[2]:        print("\nThe following sentence was predicted correctly:")            print(sentences[int(batch[3].item())])

CodeHunter

How to convert a list of strings into a tensor in pytorch?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last