Elasticsearch zScore Calculation (merging 2 document collections with different score distributions)

sorting elasticsearch merge statistics scoring

As I always tell my students, the answer to "which method is best" almost always begins with "it depends". I'll leave the mechanics of computing Z-scores aside; it's easy enough to look up here, in the ES documentation, or elsewhere on line.

The best method of normalization depends on the original distribution, and which properties thereof you need to preserve. Z-scores are highly consistent with a Gaussian distribution, presupposing that the distribution is symmetric and that the s.d. is related to a "relatively smooth" distribution.

Also, Z-scores are effective in that you can do the computation on any well-ordered metric. The transformation preserves ordering, continuity, and a variety of other topological and mathematical properties.

On the other hand ...

Consider for a moment a Poisson distribution, mu = sd = 1. You can have positive Z-scores without limit; those in the range of +1 to +3 are common enough. On the other side, a Z-score below -1 is impossible, although the range from there to 0 is full enough. If this isn't what you intend to represent, consider another method.

Similarly, consider a bi-normal distribution with modes at +1 and -1, mu=0, sd = 2. There will be clusters of Z-scores around -0.5 and +0.5, relatively few at 0.

That said, an important consideration is whether the distributions you're merging are of similar shape. If so, then your chosen scaling transformation matters little, so long as you can work with the merged Z-scores, or the transformation is invertible: can you "unpack" the resulting Z-scores to recover the original distribution shape.

If you merge a collection of Poisson distributions using Z-scores, you'll have little trouble to unpack them into one combined Poisson. If you try this with Gaussians, you'll also get good results. However, if you merge a collection of bi-normal distributions with wildly differing textures (focus on the valley depth around Z=0), you can wind up smearing your merge too widely; you'd want to pay attention to the modes as much as the mean, perhaps adjusting the Z-scores such that the modes fall at -1 and +1 in each transformation.

If you have differing distributions, also consider the number of observations in each. If you have 10,000 observations from a Poisson and 100 from a textbook normal distribution, the resulting merge will erase the normal.

These problems with distributions of different shapes, but merged into the same space, should really be the only problem with using Z-scores. If you are merging such distributions, then please give us more details, as the merging method will depend on some of the considerations I've mentioned here.

These are not normal distributions at all. These appear to be something in the exponential - geometric family. However, being in the same family makes them good candidates for merging.

However, the difference in shape makes them poor candidates for merging via z-score: the mean is far too sensitive to the largest handful of elements. Instead, I suggest that you take the logarithm of each number (any base), and then turn those values into z-scores. To restore the combined shape, raise a chosen base (2, 10, e), to the z-score power. If you don't like the tiny values, simply multiply everything by a chosen scale factor -- perhaps enough to restore the actual values of one original distribution or the other.

sorting elasticsearch merge statistics scoring

This is an old post and I had a similar requirment and we went with the same approach. Elasticsearch didn't have this feature, so I created a small which normalizes scores returned from the elastic search using min-max or z-score normalizer.

https://github.com/bkatwal/elasticsearch-score-normalizer

CodeHunter

Elasticsearch zScore Calculation (merging 2 document collections with different score distributions)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last