Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together) Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together) sqlite sqlite

Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)


The spellfix1 documentation actually tells you how to do this. From the Overview section:

If you intend to use this virtual table in cooperation with an FTS4 table (for spelling correction of search terms) then you might extract the vocabulary using an fts4aux table:

INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';

The SELECT term from search_aux WHERE col='*' statement extracts all the indexed tokens.

Connecting this with your examples, where mytable2 is your fts4 virtual table, you can create a fts4aux table and insert those tokens into your mytable3 spellfix1 table with:

CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2);INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';

You probably want to further qualify that query to skip any terms already inserted into spellfix1, otherwise you end up with double entries:

INSERT INTO mytable3(word)    SELECT term FROM mytable2_terms    WHERE col='*' AND         term not in (SELECT word from mytable3_vocab);

Now you can use mytable3 to map misspelled words to corrected tokens, then use those corrected tokens in a MATCH query againsts mytable2.

Depending on your neads, this may mean you need to do your own token handling and query building; there is no exposed fts4 query syntax parser. So your two-token search string would need to be split, each token run through the spellfix1 table to map to existing tokens, and then those tokens fed to the fts4 query.

Ignoring SQL syntax to handle this, using Python to do the splitting is easy enough:

def spellcheck_terms(conn, terms):    cursor = conn.cursor()    base_spellfix = """        SELECT :term{0} as term, word FROM spellfix1data        WHERE word MATCH :term{0} and top=1    """    terms = terms.split()    params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}    query = " UNION ".join([        base_spellfix.format(i + 1) for i in range(len(params))])    cursor.execute(query, params)    correction_map = dict(cursor)    return " ".join([correction_map.get(t, t) for t in terms])def spellchecked_search(conn, terms):    corrected_terms = spellcheck_terms(conn, terms)    cursor = conn.cursor()    fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?'    cursor.execute(fts_query, (corrected_terms,))    return cursor.fetchall()

This then returns [('All the Carmichael numbers',)] for spellchecked_search(db, "NUMMBER carmickaeel").

Keeping the spellcheck handling in Python then allows you to support more complex FTS queries as needed; you may have to reimplement the expression parser to do so, but at least Python gives you the tools to do just that.

A complete example, packaging up the above approach in a class, which simply extract terms as alphanumeric character sequences (which, by my reading of the expression syntax specs, suffices):

import reimport sqlite3import sysclass FTS4SpellfixSearch(object):    def __init__(self, conn, spellfix1_path):        self.conn = conn        self.conn.enable_load_extension(True)        self.conn.load_extension(spellfix1_path)    def create_schema(self):        self.conn.executescript(            """            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data                USING fts4(description text);            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms                USING fts4aux(fts4data);            CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data                USING spellfix1;            """        )    def index_text(self, *text):        cursor = self.conn.cursor()        with self.conn:            params = ((t,) for t in text)            cursor.executemany("INSERT INTO fts4data VALUES (?)", params)            cursor.execute(                """                INSERT INTO spellfix1data(word)                SELECT term FROM fts4data_terms                WHERE col='*' AND                    term not in (SELECT word from spellfix1data_vocab)                """            )    # fts3 / 4 search expression tokenizer    # no attempt is made to validate the expression, only    # to identify valid search terms and extract them.    # the fts3/4 tokenizer considers any alphanumeric ASCII character    # and character in the range U+0080 and over to be terms.    if sys.maxunicode == 0xFFFF:        # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+")    else:        # UCS4        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+")    def _terms_from_query(self, search_query):        """Extract search terms from a fts3/4 query        Returns a list of terms and a template such that        template.format(*terms) reconstructs the original query.        terms using partial* syntax are ignored, as you can't distinguish        between a misspelled prefix search that happens to match existing        tokens and a valid spelling that happens to have 'near' tokens in        the spellfix1 database that would not otherwise be matched by fts4        """        template, terms, lastpos = [], [], 0        for match in self._fts4_expr_terms.finditer(search_query):            token, (start, end) = match.group(), match.span()            # skip columnname: and partial* terms by checking next character            ismeta = search_query[end:end + 1] in {":", "*"}            # skip digits if preceded by "NEAR/"            ismeta = ismeta or (                token.isdigit() and template and template[-1] == "NEAR"                and "/" in search_query[lastpos:start])            if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta:                # full search term, not a keyword, column name or partial*                terms.append(token)                token = "{}"            template += search_query[lastpos:start], token            lastpos = end        template.append(search_query[lastpos:])        return terms, "".join(template)    def spellcheck_terms(self, search_query):        cursor = self.conn.cursor()        base_spellfix = """            SELECT :term{0} as term, word FROM spellfix1data            WHERE word MATCH :term{0} and top=1        """        terms, template = self._terms_from_query(search_query)        params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}        query = " UNION ".join(            [base_spellfix.format(i + 1) for i in range(len(params))]        )        cursor.execute(query, params)        correction_map = dict(cursor)        return template.format(*(correction_map.get(t, t) for t in terms))    def search(self, search_query):        corrected_query = self.spellcheck_terms(search_query)        cursor = self.conn.cursor()        fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?"        cursor.execute(fts_query, (corrected_query,))        return {            "terms": search_query,            "corrected": corrected_query,            "results": cursor.fetchall(),        }

and an interactive demo using the class:

>>> db = sqlite3.connect(":memory:")>>> fts = FTS4SpellfixSearch(db, './spellfix')>>> fts.create_schema()>>> fts.index_text("All the Carmichael numbers")  # your example>>> from pprint import pprint>>> pprint(fts.search('NUMMBER carmickaeel')){'corrected': 'numbers carmichael', 'results': [('All the Carmichael numbers',)], 'terms': 'NUMMBER carmickaeel'}>>> fts.index_text(...     "They are great",...     "Here some other numbers",... )>>> pprint(fts.search('here some'))  # edgecase, multiple spellfix matches{'corrected': 'here some', 'results': [('Here some other numbers',)], 'terms': 'here some'}>>> pprint(fts.search('NUMMBER NOT carmickaeel'))  # using fts4 query syntax {'corrected': 'numbers NOT carmichael', 'results': [('Here some other numbers',)], 'terms': 'NUMMBER NOT carmickaeel'}


The accepted answer is good (full credit to him), here is a slight variation that, although not as complete as the accepted one for complex cases, is helpful to grasp the idea:

import sqlite3db = sqlite3.connect(':memory:')db.enable_load_extension(True)db.load_extension('./spellfix')c = db.cursor()c.execute("CREATE VIRTUAL TABLE mytable2 USING fts4(description text)")c.execute("CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2)")c.execute("CREATE VIRTUAL TABLE mytable3 USING spellfix1")c.execute("INSERT INTO mytable2 VALUES ('All the Carmichael numbers')")   # populate the tablec.execute("INSERT INTO mytable2 VALUES ('They are great')")c.execute("INSERT INTO mytable2 VALUES ('Here some other numbers')")c.execute("INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*'")def search(query):    # Correcting each query term with spellfix table    correctedquery = []    for t in query.split():        spellfix_query = "SELECT word FROM mytable3 WHERE word MATCH ? and top=1"        c.execute(spellfix_query, (t,))        r = c.fetchone()        correctedquery.append(r[0] if r is not None else t)  # correct the word if any match in the spellfix table; if no match, keep the word spelled as it is (then the search will give no result!)    correctedquery = ' '.join(correctedquery)    # Now do the FTS    fts_query = 'SELECT * FROM mytable2 WHERE description MATCH ?'    c.execute(fts_query, (correctedquery,))    return {'result': c.fetchall(), 'correctedquery': correctedquery, 'query': query}print(search('NUMBBERS carmickaeel'))print(search('some HERE'))print(search('some qsdhiuhsd'))

Here is the result:

{'query': 'NUMBBERS carmickaeel', 'correctedquery': u'numbers carmichael', 'result': [(u'All the Carmichael numbers',)]}
{'query': 'some HERE', 'correctedquery': u'some here', 'result': [(u'Here some other numbers',)]}
{'query': 'some qsdhiuhsd', 'correctedquery': u'some qsdhiuhsd', 'result': []}

Remark: It can be noted that the "Correcting each query term with spellfix table" part is done with one SQL query per term. The performance of this versus one single UNION SQL query is studied here.