How to compare sentence similarities using embeddings from BERT How to compare sentence similarities using embeddings from BERT python python

How to compare sentence similarities using embeddings from BERT


In addition to an already great accepted answer, I want to point you to sentence-BERT, which discusses the similarity aspect and implications of specific metrics (like cosine similarity) in greater detail.They also have a very convenient implementation online. The main advantage here is that they seemingly gain a lot of processing speed compared to a "naive" sentence embedding comparison, but I am not familiar enough with the implementation itself.

Importantly, there is also generally a more fine-grained distinction in what kind of similarity you want to look at. Specifically for that, there is also a great discussion in one of the task papers from SemEval 2014 (SICK dataset), which goes into more detail about this. From your task description, I am assuming that you are already using data from one of the later SemEval tasks, which also extended this to multilingual similarity.


You can use the [CLS] token as a representation for the entire sequence. This token is typically prepended to your sentence during the preprocessing step. This token that is typically used for classification tasks (see figure 2 and paragraph 3.2 in the BERT paper).

It is the very first token of the embedding.

Alternatively you can take the average vector of the sequence (like you say over the first(?) axis), which can yield better results according to the huggingface documentation (3rd tip).

Note that BERT was not designed for sentence similarity using the cosine distance, though in my experience it does yield decent results.