How to compare sentence similarities using embeddings from BERT

python vector nlp cosine-similarity huggingface-transformers

In addition to an already great accepted answer, I want to point you to sentence-BERT, which discusses the similarity aspect and implications of specific metrics (like cosine similarity) in greater detail.They also have a very convenient implementation online. The main advantage here is that they seemingly gain a lot of processing speed compared to a "naive" sentence embedding comparison, but I am not familiar enough with the implementation itself.

Importantly, there is also generally a more fine-grained distinction in what kind of similarity you want to look at. Specifically for that, there is also a great discussion in one of the task papers from SemEval 2014 (SICK dataset), which goes into more detail about this. From your task description, I am assuming that you are already using data from one of the later SemEval tasks, which also extended this to multilingual similarity.

python vector nlp cosine-similarity huggingface-transformers

You can use the [CLS] token as a representation for the entire sequence. This token is typically prepended to your sentence during the preprocessing step. This token that is typically used for classification tasks (see figure 2 and paragraph 3.2 in the BERT paper).

It is the very first token of the embedding.

Alternatively you can take the average vector of the sequence (like you say over the first(?) axis), which can yield better results according to the huggingface documentation (3rd tip).

Note that BERT was not designed for sentence similarity using the cosine distance, though in my experience it does yield decent results.

CodeHunter

How to compare sentence similarities using embeddings from BERT

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last