Matching by part url
I think the better solution is to not use _id
for this problem.
Instead, index field called path
(or whatever name you want) and look at using the Path Hierarchy Tokenizer with some creative token filters.
This way you can use Elasticsearch/Lucene to tokenize the URLs.
For example: https://site/folder
gets tokenized as two tokens:
site
site/folder
Then, you could for any file or folder contained in the site
folder by searching for the right token: site
.
PUT /test{ "settings": { "analysis": { "filter": { "http_dropper": { "type": "pattern_replace", "pattern": "^https?:/{0,}(.*)", "replacement": "$1" }, "empty_dropper": { "type": "length", "min": 1 }, "qs_dropper": { "type": "pattern_replace", "pattern": "(.*)[?].*", "replacement": "$1" }, "trailing_slash_dropper": { "type": "pattern_replace", "pattern": "(.*)/+$", "replacement": "$1" } }, "analyzer": { "url": { "tokenizer": "path_hierarchy", "filter": [ "http_dropper", "qs_dropper", "trailing_slash_dropper", "empty_dropper", "unique" ] } } } }, "mappings": { "type" : { "properties": { "url" : { "type": "string", "analyzer": "url" }, "type" : { "type": "string", "index": "not_analyzed" } } } }}
You may or may not want the trailing_slash_dropper
that I added. It may also be worthwhile to have the lowercase
token filter in there, but that actually could make some URL tokens fundamentally incorrect (e.g., mysite.com/bucket/AaDsaAe31AcxX
may really care about the case of those characters). You can take the analyzer for a test drive with the _analyze
endpoint:
GET /test/_analyze?analyzer=url&text=http://test.com/text/a/?value=xyz&abc=value
Note: I'm using Sense, so it does the URL encoding for me. This will produce three tokens:
{ "tokens": [ { "token": "test.com", "start_offset": 0, "end_offset": 15, "type": "word", "position": 0 }, { "token": "test.com/text", "start_offset": 0, "end_offset": 20, "type": "word", "position": 0 }, { "token": "test.com/text/a", "start_offset": 0, "end_offset": 22, "type": "word", "position": 0 } ]}
Tying it all together:
POST /test/type{ "type" : "dir", "url" : "https://site"}POST /test/type{ "type" : "dir", "url" : "https://site/folder"}POST /test/type{ "type" : "file", "url" : "http://site/folder/document_name.doc"}POST /test/type{ "type" : "file", "url" : "http://other/site/folder/document_name.doc"}POST /test/type{ "type" : "file", "url" : "http://other_site/folder/document_name.doc"}POST /test/type{ "type" : "file", "url" : "http://site/mirror/document_name.doc"}GET /test/_search{ "query": { "bool": { "must": [ { "match": { "url": "http://site/folder" } } ], "filter": [ { "term": { "type": "file" } } ] } }}
It's important to test this so that you can see what matches, and the order of those matches. Naturally this finds the document that you expect it to find (and puts it at the top!), but it also finds some other documents that you might not expect, like http://site/mirror/document_name.doc
because it shares the base token: site
. There are a bunch of strategies that you can use to exclude those documents if it's important to exclude them.
You can take advantage of your tokenization to perform Google-like results filtering, like how you can search specific domains via Google:
match query site:elastic.co
You could then parse (manually) the site:elastic.co
and take the elastic.co
as a bounding url:
{ "term" : { "url" : "elastic.co" }}
Note that this is different from searching for the URL. You're explicitly saying "only include documents that contain this exact token in their url". You can go further with site:elastic.co/blog
and so on because that exact token exists. However, it's important to note that if you were to try site:elastic.co/blog/
, then that would find no documents because that token cannot exist given the token filters.