Searching subtitle data in elasticsearch

elasticsearch elasticsearch-query elasticsearch-mapping elasticsearch-model

Interesting question. Here's my take on it.

In essence, the subtitles "don't know" about each other — meaning that it'd be best to contain the previous and subsequent subtitle text in each doc (n - 1, n, n + 1) whenever applicable.

As such, you'd be gunning for a doc structure similar to:

{  "sub_id" : 0,  "start" : "00:02:17,440",  "end" : "00:02:20,375",  "text" : "Senator, we're making our final",  "overlapping_text" : "Senator, we're making our final approach into Coruscant."}

To arrive at such a doc structure I used the following (inspired by this excellent answer):

from itertools import groupbyfrom collections import namedtupledef parse_subs(fpath):    # "chunk" our input file, delimited by blank lines    with open(fpath) as f:        res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]    Subtitle = namedtuple('Subtitle', 'sub_id start end text')    subs = []    # grouping    for sub in res:        if len(sub) >= 3:  # not strictly necessary, but better safe than sorry            sub = [x.strip() for x in sub]            sub_id, start_end, *content = sub  # py3 syntax            start, end = start_end.split(' --> ')            # ints only            sub_id = int(sub_id)            # join multi-line text            text = ', '.join(content)            subs.append(Subtitle(                sub_id,                start,                end,                text            ))    es_ready_subs = []    for index, sub_object in enumerate(subs):        prev_sub_text = ''        next_sub_text = ''        if index > 0:            prev_sub_text = subs[index - 1].text + ' '        if index < len(subs) - 1:            next_sub_text = ' ' + subs[index + 1].text        es_ready_subs.append(dict(            **sub_object._asdict(),            overlapping_text=prev_sub_text + sub_object.text + next_sub_text        ))    return es_ready_subs

Once the subtitles are parsed, they can be ingested into ES. Before that's done, set up the following mapping so that your timestamps are properly searchable and sortable:

PUT my_subtitles_index{  "mappings": {    "properties": {      "start": {        "type": "text",        "fields": {          "as_timestamp": {            "type": "date",            "format": "HH:mm:ss,SSS"          }        }      },      "end": {        "type": "text",        "fields": {          "as_timestamp": {            "type": "date",            "format": "HH:mm:ss,SSS"          }        }      }    }  }}

Once that's done, proceed to ingest:

from elasticsearch import Elasticsearchfrom elasticsearch.helpers import bulkfrom utils.parse import parse_subses = Elasticsearch()es_ready_subs = parse_subs('subs.txt')actions = [    {        "_index": "my_subtitles_index",        "_id": sub_group['sub_id'],        "_source": sub_group    } for sub_group in es_ready_subs]bulk(es, actions)

Once ingested, you can target the original subtitle text and boost it if it directly matches your phrase. Otherwise, add a fallback on the overlapping text which'll ensure that both "overlapping" subtitles are returned.

Before returning, you can make sure that the hits are ordered by the start, ascending. That kind of defeats the purpose of boosting but if you do sort, you can specify track_scores:true in the URI to make sure the originally calculated scores are returned too.

Putting it all together:

POST my_subtitles_index/_search?track_scores&filter_path=hits.hits{  "query": {    "bool": {      "should": [        {          "match_phrase": {            "text": {              "query": "final approach",              "boost": 2            }          }        },        {          "match_phrase": {            "overlapping_text": {              "query": "final approach"            }          }        }      ]    }  },  "sort": [    {      "start.as_timestamp": {        "order": "asc"      }    }  ]}

yields:

{  "hits" : {    "hits" : [      {        "_index" : "my_subtitles_index",        "_type" : "_doc",        "_id" : "0",        "_score" : 6.0236287,        "_source" : {          "sub_id" : 0,          "start" : "00:02:17,440",          "end" : "00:02:20,375",          "text" : "Senator, we're making our final",          "overlapping_text" : "Senator, we're making our final approach into Coruscant."        },        "sort" : [          137440        ]      },      {        "_index" : "my_subtitles_index",        "_type" : "_doc",        "_id" : "1",        "_score" : 5.502407,        "_source" : {          "sub_id" : 1,          "start" : "00:02:20,476",          "end" : "00:02:22,501",          "text" : "approach into Coruscant.",          "overlapping_text" : "Senator, we're making our final approach into Coruscant. Very good, Lieutenant."        },        "sort" : [          140476        ]      }    ]  }}

CodeHunter

Searching subtitle data in elasticsearch

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last