Matching by part url Matching by part url elasticsearch elasticsearch

Matching by part url


I think the better solution is to not use _id for this problem.

Instead, index field called path (or whatever name you want) and look at using the Path Hierarchy Tokenizer with some creative token filters.

This way you can use Elasticsearch/Lucene to tokenize the URLs.

For example: https://site/folder gets tokenized as two tokens:

  • site
  • site/folder

Then, you could for any file or folder contained in the site folder by searching for the right token: site.

PUT /test{  "settings": {    "analysis": {      "filter": {        "http_dropper": {          "type": "pattern_replace",          "pattern": "^https?:/{0,}(.*)",          "replacement": "$1"        },        "empty_dropper": {          "type": "length",          "min": 1        },        "qs_dropper": {          "type": "pattern_replace",          "pattern": "(.*)[?].*",          "replacement": "$1"        },        "trailing_slash_dropper": {          "type": "pattern_replace",          "pattern": "(.*)/+$",          "replacement": "$1"        }      },      "analyzer": {        "url": {          "tokenizer": "path_hierarchy",          "filter": [            "http_dropper",            "qs_dropper",            "trailing_slash_dropper",            "empty_dropper",            "unique"          ]        }      }    }  },  "mappings": {    "type" : {      "properties": {        "url" : {          "type": "string",          "analyzer": "url"        },        "type" : {          "type": "string",          "index": "not_analyzed"        }      }    }  }}

You may or may not want the trailing_slash_dropper that I added. It may also be worthwhile to have the lowercase token filter in there, but that actually could make some URL tokens fundamentally incorrect (e.g., mysite.com/bucket/AaDsaAe31AcxX may really care about the case of those characters). You can take the analyzer for a test drive with the _analyze endpoint:

GET /test/_analyze?analyzer=url&text=http://test.com/text/a/?value=xyz&abc=value

Note: I'm using Sense, so it does the URL encoding for me. This will produce three tokens:

{  "tokens": [    {      "token": "test.com",      "start_offset": 0,      "end_offset": 15,      "type": "word",      "position": 0    },    {      "token": "test.com/text",      "start_offset": 0,      "end_offset": 20,      "type": "word",      "position": 0    },    {      "token": "test.com/text/a",      "start_offset": 0,      "end_offset": 22,      "type": "word",      "position": 0    }  ]}

Tying it all together:

POST /test/type{  "type" : "dir",  "url" : "https://site"}POST /test/type{  "type" : "dir",  "url" : "https://site/folder"}POST /test/type{  "type" : "file",  "url" : "http://site/folder/document_name.doc"}POST /test/type{  "type" : "file",  "url" : "http://other/site/folder/document_name.doc"}POST /test/type{  "type" : "file",  "url" : "http://other_site/folder/document_name.doc"}POST /test/type{  "type" : "file",  "url" : "http://site/mirror/document_name.doc"}GET /test/_search{  "query": {    "bool": {      "must": [        {          "match": {            "url": "http://site/folder"          }        }      ],      "filter": [        {          "term": {            "type": "file"          }        }      ]    }  }}

It's important to test this so that you can see what matches, and the order of those matches. Naturally this finds the document that you expect it to find (and puts it at the top!), but it also finds some other documents that you might not expect, like http://site/mirror/document_name.doc because it shares the base token: site. There are a bunch of strategies that you can use to exclude those documents if it's important to exclude them.

You can take advantage of your tokenization to perform Google-like results filtering, like how you can search specific domains via Google:

match query site:elastic.co

You could then parse (manually) the site:elastic.co and take the elastic.co as a bounding url:

{  "term" : {    "url" : "elastic.co"  }}

Note that this is different from searching for the URL. You're explicitly saying "only include documents that contain this exact token in their url". You can go further with site:elastic.co/blog and so on because that exact token exists. However, it's important to note that if you were to try site:elastic.co/blog/, then that would find no documents because that token cannot exist given the token filters.