Filename search with ElasticSearch Filename search with ElasticSearch elasticsearch elasticsearch

Filename search with ElasticSearch


You have various problems with what you pasted:

1) Incorrect mapping

When creating the index, you specify:

"mappings": {    "files": {

But your type is actually file, not files. If you checked the mapping, you would see that immediately:

curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1' # {#    "files" : {#       "files" : {#          "properties" : {#             "filename" : {#                "type" : "string",#                "analyzer" : "filename_analyzer"#             }#          }#       },#       "file" : {#          "properties" : {#             "filename" : {#                "type" : "string"#             }#          }#       }#    }# }

2) Incorrect analyzer definition

You have specified the lowercase tokenizer but that removes anything that isn't a letter, (see docs), so your numbers are being completely removed.

You can check this with the analyze API:

curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase' # {#    "tokens" : [#       {#          "end_offset" : 2,#          "position" : 1,#          "start_offset" : 0,#          "type" : "word",#          "token" : "my"#       },#       {#          "end_offset" : 7,#          "position" : 2,#          "start_offset" : 3,#          "type" : "word",#          "token" : "file"#       },#       {#          "end_offset" : 22,#          "position" : 3,#          "start_offset" : 19,#          "type" : "word",#          "token" : "doc"#       }#    ]# }

3) Ngrams on search

You include your ngram token filter in both the index analyzer and the search analyzer. That's fine for the index analyzer, because you want the ngrams to be indexed. But when you search, you want to search on the full string, not on each ngram.

For instance, if you index "abcd" with ngrams of length 1 to 4, you will end up with these tokens:

a b c d ab bc cd abc bcd

But if you search on "dcba" (which shouldn't match) and you also analyze your search terms with ngrams, then you are actually searching on:

d c b a dc cb ba dbc cba

So a,b,c and d will match!

Solution

First, you need to choose the right analyzer. Your users will probably search for words, numbers or dates, but they probably won't expect ile to match file. Instead, it will probably be more useful to use edge ngrams, which will anchor the ngram to the start (or end) of each word.

Also, why exclude docx etc? Surely a user may well want to search on the file type?

So lets break up each filename into smaller tokens by removing anything that isn't a letter or a number (using the pattern tokenizer):

My_first_file_2012.01.13.doc=> my first file 2012 01 13 doc

Then for the index analyzer, we'll also use edge ngrams on each of those tokens:

my     => m myfirst  => f fi fir firs firstfile   => f fi fil file2012   => 2 20 201 20101     => 0 0113     => 1 13doc    => d do doc

We create the index as follows:

curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1'  -d '{   "settings" : {      "analysis" : {         "analyzer" : {            "filename_search" : {               "tokenizer" : "filename",               "filter" : ["lowercase"]            },            "filename_index" : {               "tokenizer" : "filename",               "filter" : ["lowercase","edge_ngram"]            }         },         "tokenizer" : {            "filename" : {               "pattern" : "[^\\p{L}\\d]+",               "type" : "pattern"            }         },         "filter" : {            "edge_ngram" : {               "side" : "front",               "max_gram" : 20,               "min_gram" : 1,               "type" : "edgeNGram"            }         }      }   },   "mappings" : {      "file" : {         "properties" : {            "filename" : {               "type" : "string",               "search_analyzer" : "filename_search",               "index_analyzer" : "filename_index"            }         }      }   }}'

Now, test that the our analyzers are working correctly:

filename_search:

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search' [results snipped]"token" : "my""token" : "first""token" : "file""token" : "2012""token" : "01""token" : "13""token" : "doc"

filename_index:

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index' "token" : "m""token" : "my""token" : "f""token" : "fi""token" : "fir""token" : "firs""token" : "first""token" : "f""token" : "fi""token" : "fil""token" : "file""token" : "2""token" : "20""token" : "201""token" : "2012""token" : "0""token" : "01""token" : "1""token" : "13""token" : "d""token" : "do""token" : "doc"

OK - seems to be working correctly. So let's add some docs:

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'curl -X POST "http://localhost:9200/files/_refresh"

And try a search:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '{   "query" : {      "text" : {         "filename" : "2012.01"      }   }}'# {#    "hits" : {#       "hits" : [#          {#             "_source" : {#                "filename" : "My_second_file_created_at_2012.01.13.pdf"#             },#             "_score" : 0.06780553,#             "_index" : "files",#             "_id" : "PsDvfFCkT4yvJnlguxJrrQ",#             "_type" : "file"#          },#          {#             "_source" : {#                "filename" : "My_first_file_created_at_2012.01.13.doc"#             },#             "_score" : 0.06780553,#             "_index" : "files",#             "_id" : "ER5RmyhATg-Eu92XNGRu-w",#             "_type" : "file"#          }#       ],#       "max_score" : 0.06780553,#       "total" : 2#    },#    "timed_out" : false,#    "_shards" : {#       "failed" : 0,#       "successful" : 5,#       "total" : 5#    },#    "took" : 4# }

Success!

#### UPDATE ####

I realised that a search for 2012.01 would match both 2012.01.12 and 2012.12.01 so I tried changing the query to use a text phrase query instead. However, this didn't work. It turns out that the edge ngram filter increments the position count for each ngram (while I would have thought that the position of each ngram would be the same as for the start of the word).

The issue mentioned in point (3) above is only a problem when using a query_string, field, or text query which tries to match ANY token. However, for a text_phrase query, it tries to match ALL of the tokens, and in the correct order.

To demonstrate the issue, index another doc with a different date:

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }'curl -X POST "http://localhost:9200/files/_refresh"

And do a the same search as above:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '{   "query" : {      "text" : {         "filename" : {            "query" : "2012.01"         }      }   }}'# {#    "hits" : {#       "hits" : [#          {#             "_source" : {#                "filename" : "My_third_file_created_at_2012.12.01.doc"#             },#             "_score" : 0.22097087,#             "_index" : "files",#             "_id" : "xmC51lIhTnWplOHADWJzaQ",#             "_type" : "file"#          },#          {#             "_source" : {#                "filename" : "My_first_file_created_at_2012.01.13.doc"#             },#             "_score" : 0.13137488,#             "_index" : "files",#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",#             "_type" : "file"#          },#          {#             "_source" : {#                "filename" : "My_second_file_created_at_2012.01.13.pdf"#             },#             "_score" : 0.13137488,#             "_index" : "files",#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",#             "_type" : "file"#          }#       ],#       "max_score" : 0.22097087,#       "total" : 3#    },#    "timed_out" : false,#    "_shards" : {#       "failed" : 0,#       "successful" : 5,#       "total" : 5#    },#    "took" : 5# }

The first result has a date 2012.12.01 which isn't the best match for 2012.01. So to match only that exact phrase, we can do:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '{   "query" : {      "text_phrase" : {         "filename" : {            "query" : "2012.01",            "analyzer" : "filename_index"         }      }   }}'# {#    "hits" : {#       "hits" : [#          {#             "_source" : {#                "filename" : "My_first_file_created_at_2012.01.13.doc"#             },#             "_score" : 0.55737644,#             "_index" : "files",#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",#             "_type" : "file"#          },#          {#             "_source" : {#                "filename" : "My_second_file_created_at_2012.01.13.pdf"#             },#             "_score" : 0.55737644,#             "_index" : "files",#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",#             "_type" : "file"#          }#       ],#       "max_score" : 0.55737644,#       "total" : 2#    },#    "timed_out" : false,#    "_shards" : {#       "failed" : 0,#       "successful" : 5,#       "total" : 5#    },#    "took" : 7# }

Or, if you still want to match all 3 files (because the user might remember some of the words in the filename, but in the wrong order), you can run both queries but increase the importance of the filename which is in the correct order:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '{   "query" : {      "bool" : {         "should" : [            {               "text_phrase" : {                  "filename" : {                     "boost" : 2,                     "query" : "2012.01",                     "analyzer" : "filename_index"                  }               }            },            {               "text" : {                  "filename" : "2012.01"               }            }         ]      }   }}'# [Fri Feb 24 16:31:02 2012] Response:# {#    "hits" : {#       "hits" : [#          {#             "_source" : {#                "filename" : "My_first_file_created_at_2012.01.13.doc"#             },#             "_score" : 0.56892186,#             "_index" : "files",#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",#             "_type" : "file"#          },#          {#             "_source" : {#                "filename" : "My_second_file_created_at_2012.01.13.pdf"#             },#             "_score" : 0.56892186,#             "_index" : "files",#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",#             "_type" : "file"#          },#          {#             "_source" : {#                "filename" : "My_third_file_created_at_2012.12.01.doc"#             },#             "_score" : 0.012931341,#             "_index" : "files",#             "_id" : "xmC51lIhTnWplOHADWJzaQ",#             "_type" : "file"#          }#       ],#       "max_score" : 0.56892186,#       "total" : 3#    },#    "timed_out" : false,#    "_shards" : {#       "failed" : 0,#       "successful" : 5,#       "total" : 5#    },#    "took" : 4# }


I believe this is because of the tokenizer being used..

http://www.elasticsearch.org/guide/reference/index-modules/analysis/lowercase-tokenizer.html

The lowercase tokenizer splits out on word boundaries so 2012.01.13 will be indexed as "2012","01" and "13". Searching for the string "2012.01.13" will obviously not match.

One option would be to add the tokenisation on search as well. Therefore, searching for "2012.01.13" will be tokenised down to the same tokens as in the index and it will match. This is also handy as you then don't need to always lowercase your searches in code.

The second option would be to use an n-gram tokenizer instead of the filter. This will mean that it will ignore word boundaries (and you will get the "_"'s as well), however you may have issues with case mismatches, which is presumably the reason you added the lowercase tokenizer in the first place.


I have no experience with ES, but in Solr you would need to specify the field type as text.Your field is of type string instead of text. String fields, are not analyzed, but stored and indexed verbatim. Give that a shot and see if it works.

properties": {        "filename": {          "type": "string",          "analyzer": "filename_analyzer"        }