How to get term vector info for the whole index in elastic search?, not at a document level How to get term vector info for the whole index in elastic search?, not at a document level elasticsearch elasticsearch

How to get term vector info for the whole index in elastic search?, not at a document level


Several stats are stored using term vectors for all documents in a shard (why not index?...keep reading).

  • total term frequency (how often a term occurs in all documents)
  • document frequency (the number of documents containing the current term)

To get this to work, you must enable term_vectors for the field you want to analyze. This is best done by adding term_vectors to the field definition when setting up the mapping since the calculations are done at index time and that speeds up term vector retrieval.

Then when retrieving term vectors just add the "term_statistics" parameter and the ttf is included in the output. See this example:

GET /twitter/_doc/1/_termvectors { "fields" : ["text"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }

However, be aware, that term_vectors and the "more like this" query that relies upon term_vectors is not accurate if the index uses multiple shards. Say it ain't so!

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context.

If you want accurate statistics you must setup your index as single shard, which defeats the purpose of using elastic search since you can't cluster a single shard. Another Stackoverflow submitter fell into this trap. If anybody knows of a solution please post.