Searching subtitle data in elasticsearch
Interesting question. Here's my take on it.
In essence, the subtitles "don't know" about each other — meaning that it'd be best to contain the previous and subsequent subtitle text in each doc (n - 1
, n
, n + 1
) whenever applicable.
As such, you'd be gunning for a doc structure similar to:
{ "sub_id" : 0, "start" : "00:02:17,440", "end" : "00:02:20,375", "text" : "Senator, we're making our final", "overlapping_text" : "Senator, we're making our final approach into Coruscant."}
To arrive at such a doc structure I used the following (inspired by this excellent answer):
from itertools import groupbyfrom collections import namedtupledef parse_subs(fpath): # "chunk" our input file, delimited by blank lines with open(fpath) as f: res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b] Subtitle = namedtuple('Subtitle', 'sub_id start end text') subs = [] # grouping for sub in res: if len(sub) >= 3: # not strictly necessary, but better safe than sorry sub = [x.strip() for x in sub] sub_id, start_end, *content = sub # py3 syntax start, end = start_end.split(' --> ') # ints only sub_id = int(sub_id) # join multi-line text text = ', '.join(content) subs.append(Subtitle( sub_id, start, end, text )) es_ready_subs = [] for index, sub_object in enumerate(subs): prev_sub_text = '' next_sub_text = '' if index > 0: prev_sub_text = subs[index - 1].text + ' ' if index < len(subs) - 1: next_sub_text = ' ' + subs[index + 1].text es_ready_subs.append(dict( **sub_object._asdict(), overlapping_text=prev_sub_text + sub_object.text + next_sub_text )) return es_ready_subs
Once the subtitles are parsed, they can be ingested into ES. Before that's done, set up the following mapping so that your timestamps are properly searchable and sortable:
PUT my_subtitles_index{ "mappings": { "properties": { "start": { "type": "text", "fields": { "as_timestamp": { "type": "date", "format": "HH:mm:ss,SSS" } } }, "end": { "type": "text", "fields": { "as_timestamp": { "type": "date", "format": "HH:mm:ss,SSS" } } } } }}
Once that's done, proceed to ingest:
from elasticsearch import Elasticsearchfrom elasticsearch.helpers import bulkfrom utils.parse import parse_subses = Elasticsearch()es_ready_subs = parse_subs('subs.txt')actions = [ { "_index": "my_subtitles_index", "_id": sub_group['sub_id'], "_source": sub_group } for sub_group in es_ready_subs]bulk(es, actions)
Once ingested, you can target the original subtitle text
and boost it if it directly matches your phrase. Otherwise, add a fallback on the overlapping
text which'll ensure that both "overlapping" subtitles are returned.
Before returning, you can make sure that the hits are ordered by the start
, ascending. That kind of defeats the purpose of boosting but if you do sort, you can specify track_scores:true
in the URI to make sure the originally calculated scores are returned too.
Putting it all together:
POST my_subtitles_index/_search?track_scores&filter_path=hits.hits{ "query": { "bool": { "should": [ { "match_phrase": { "text": { "query": "final approach", "boost": 2 } } }, { "match_phrase": { "overlapping_text": { "query": "final approach" } } } ] } }, "sort": [ { "start.as_timestamp": { "order": "asc" } } ]}
yields:
{ "hits" : { "hits" : [ { "_index" : "my_subtitles_index", "_type" : "_doc", "_id" : "0", "_score" : 6.0236287, "_source" : { "sub_id" : 0, "start" : "00:02:17,440", "end" : "00:02:20,375", "text" : "Senator, we're making our final", "overlapping_text" : "Senator, we're making our final approach into Coruscant." }, "sort" : [ 137440 ] }, { "_index" : "my_subtitles_index", "_type" : "_doc", "_id" : "1", "_score" : 5.502407, "_source" : { "sub_id" : 1, "start" : "00:02:20,476", "end" : "00:02:22,501", "text" : "approach into Coruscant.", "overlapping_text" : "Senator, we're making our final approach into Coruscant. Very good, Lieutenant." }, "sort" : [ 140476 ] } ] }}