What are some good ways of estimating 'approximate' semantic similarity between sentences? What are some good ways of estimating 'approximate' semantic similarity between sentences? python python

What are some good ways of estimating 'approximate' semantic similarity between sentences?


Latent Semantic Modeling could be useful. It's basically just yet another application of the Singular Value Decomposition. The SVDLIBC is a pretty nice C implementation of this approach, which is an oldie but a goodie, and there are even python binding in the form of sparsesvd.


I suggest you try a topic modelling framework such as Latent Dirichlet Allocation (LDA). The idea there is that documents (in your case sentences, which might prove to be a problem) are generated from a set of latent (hidden) topics; LDA retrieves those topics, representing them by word clusters.

An implementation of LDA in Python is available as part of the free Gensim package. You could try to apply it to your sentences, then run k-means on its output.