How Do I Apply TF-IDF When I Only Have a Subset of the Total Documents? How Do I Apply TF-IDF When I Only Have a Subset of the Total Documents? elasticsearch elasticsearch

How Do I Apply TF-IDF When I Only Have a Subset of the Total Documents?


As you have noticed Elasticsearch is not built to run in memory constrained environments. If you want to use Elasticsearch, but can't set up a dedicated machine, you might consider using a hosted search solution (e.g. AWS Elasticsearch, Elastic Cloud, Algolia, etc.). Those solutions still cost though!

There are two great alternatives that require a bit more work (but not as much as writing your own search solution). Lucene is the actual Search Engine that Elasticsearch is written on top of. It does still load quite a bit of the underlying data structures into memory, so, depending on the size of the underlying data you want to index, it could still run out of memory. But, you should be able to fit quite a bit more data in a single Lucene index than in an entire Elasticsearch instance.

The other alternative, which I know slightly less about, is Sphinx. It is also a Search Engine. And it also allows you to specify how much memory to allocate for it to use. It stores the rest of the data on disk.