ElasticSearch - Searching For Human Names ElasticSearch - Searching For Human Names elasticsearch elasticsearch

ElasticSearch - Searching For Human Names


First, I recreated your current configuration in Play: https://www.found.no/play/gist/867785a709b4869c5543

If you go there, switch to the "Analysis"-tab to see how the text is transformed:

Note, for example that Heaney ends up tokenized as [hn, heanei] with the search_analyzer and as [HN, heanei] with the index_analyzer. Note the case-difference for the metaphone-term. Thus, that one is not matching.

The fuzzy-query does not do query time text analysis. Thus, you end up comparing Heavey with heanei. This has a Damerau-Levenshtein distance longer than what your parameters allow.

What you really want to do is using the fuzzy functionality of match. Match does do query time text analysis, and has a fuzziness-parameter.

As for the fuzziness, this changed a bit in Lucene 4. Before, it was typically specified as a float. Now it should be specified as the allowed distance. There's an outstanding pull request to clarify that: https://github.com/elasticsearch/elasticsearch/pull/4332/files

The reason why you are getting people without the forename Michael is that you are doing a bool.should. This has OR-semantics. It's sufficient that one matches, but scoring-wise it's better the more that matches.

Lastly, combining all that filtering into the same term is not necessarily the best approach. For example, you cannot know and boost exact spellings. What you should consider is using a multi_field to process the field in many ways.

Here's an example you can play with, with the curl commands to recreate it below. I'd skip using the "porter" stemmer entirely for this, however. I kept it just to show how multi_field works. Using a combination of match, match with fuzziness and phonetic matching should get you far. (Make sure you don't allow fuzziness when you do phonetic matching - or you'll get uselessly fuzzy matching. :-)

#!/bin/bashexport ELASTICSEARCH_ENDPOINT="http://localhost:9200"# Create indexescurl -XPUT "$ELASTICSEARCH_ENDPOINT/play" -d '{    "settings": {        "analysis": {            "text": [                "Michael",                "Heaney",                "Heavey"            ],            "analyzer": {                "metaphone": {                    "type": "custom",                    "tokenizer": "standard",                    "filter": [                        "my_metaphone"                    ]                },                "porter": {                    "type": "custom",                    "tokenizer": "standard",                    "filter": [                        "lowercase",                        "porter_stem"                    ]                }            },            "filter": {                "my_metaphone": {                    "encoder": "metaphone",                    "replace": false,                    "type": "phonetic"                }            }        }    },    "mappings": {        "jr": {            "properties": {                "pty_surename": {                    "type": "multi_field",                    "fields": {                        "pty_surename": {                            "type": "string",                            "analyzer": "simple"                        },                        "metaphone": {                            "type": "string",                            "analyzer": "metaphone"                        },                        "porter": {                            "type": "string",                            "analyzer": "porter"                        }                    }                }            }        }    }}'# Index documentscurl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '{"index":{"_index":"play","_type":"jr"}}{"pty_surname":"Heaney"}{"index":{"_index":"play","_type":"jr"}}{"pty_surname":"Heavey"}'# Do searchescurl -XPOST "$ELASTICSEARCH_ENDPOINT/_search?pretty" -d '{    "query": {        "bool": {            "should": [                {                    "bool": {                        "should": [                            {                                "match": {                                    "pty_surname": {                                        "query": "heavey"                                    }                                }                            },                            {                                "match": {                                    "pty_surname": {                                        "query": "heavey",                                        "fuzziness": 1                                    }                                }                            },                            {                                "match": {                                    "pty_surename.metaphone": {                                        "query": "heavey"                                    }                                }                            },                            {                                "match": {                                    "pty_surename.porter": {                                        "query": "heavey"                                    }                                }                            }                        ]                    }                }            ]        }    }}'