Matching by part url

elasticsearch

I think the better solution is to not use _id for this problem.

Instead, index field called path (or whatever name you want) and look at using the Path Hierarchy Tokenizer with some creative token filters.

This way you can use Elasticsearch/Lucene to tokenize the URLs.

For example: https://site/folder gets tokenized as two tokens:

site
site/folder

Then, you could for any file or folder contained in the site folder by searching for the right token: site.

PUT /test{  "settings": {    "analysis": {      "filter": {        "http_dropper": {          "type": "pattern_replace",          "pattern": "^https?:/{0,}(.*)",          "replacement": "$1"        },        "empty_dropper": {          "type": "length",          "min": 1        },        "qs_dropper": {          "type": "pattern_replace",          "pattern": "(.*)[?].*",          "replacement": "$1"        },        "trailing_slash_dropper": {          "type": "pattern_replace",          "pattern": "(.*)/+$",          "replacement": "$1"        }      },      "analyzer": {        "url": {          "tokenizer": "path_hierarchy",          "filter": [            "http_dropper",            "qs_dropper",            "trailing_slash_dropper",            "empty_dropper",            "unique"          ]        }      }    }  },  "mappings": {    "type" : {      "properties": {        "url" : {          "type": "string",          "analyzer": "url"        },        "type" : {          "type": "string",          "index": "not_analyzed"        }      }    }  }}

You may or may not want the trailing_slash_dropper that I added. It may also be worthwhile to have the lowercase token filter in there, but that actually could make some URL tokens fundamentally incorrect (e.g., mysite.com/bucket/AaDsaAe31AcxX may really care about the case of those characters). You can take the analyzer for a test drive with the _analyze endpoint:

GET /test/_analyze?analyzer=url&text=http://test.com/text/a/?value=xyz&abc=value

Note: I'm using Sense, so it does the URL encoding for me. This will produce three tokens:

{  "tokens": [    {      "token": "test.com",      "start_offset": 0,      "end_offset": 15,      "type": "word",      "position": 0    },    {      "token": "test.com/text",      "start_offset": 0,      "end_offset": 20,      "type": "word",      "position": 0    },    {      "token": "test.com/text/a",      "start_offset": 0,      "end_offset": 22,      "type": "word",      "position": 0    }  ]}

Tying it all together:

POST /test/type{  "type" : "dir",  "url" : "https://site"}POST /test/type{  "type" : "dir",  "url" : "https://site/folder"}POST /test/type{  "type" : "file",  "url" : "http://site/folder/document_name.doc"}POST /test/type{  "type" : "file",  "url" : "http://other/site/folder/document_name.doc"}POST /test/type{  "type" : "file",  "url" : "http://other_site/folder/document_name.doc"}POST /test/type{  "type" : "file",  "url" : "http://site/mirror/document_name.doc"}GET /test/_search{  "query": {    "bool": {      "must": [        {          "match": {            "url": "http://site/folder"          }        }      ],      "filter": [        {          "term": {            "type": "file"          }        }      ]    }  }}

It's important to test this so that you can see what matches, and the order of those matches. Naturally this finds the document that you expect it to find (and puts it at the top!), but it also finds some other documents that you might not expect, like http://site/mirror/document_name.doc because it shares the base token: site. There are a bunch of strategies that you can use to exclude those documents if it's important to exclude them.

You can take advantage of your tokenization to perform Google-like results filtering, like how you can search specific domains via Google:

match query site:elastic.co

You could then parse (manually) the site:elastic.co and take the elastic.co as a bounding url:

{  "term" : {    "url" : "elastic.co"  }}

Note that this is different from searching for the URL. You're explicitly saying "only include documents that contain this exact token in their url". You can go further with site:elastic.co/blog and so on because that exact token exists. However, it's important to note that if you were to try site:elastic.co/blog/, then that would find no documents because that token cannot exist given the token filters.

CodeHunter

Matching by part url

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last