Optimising LIKE expressions that start with wildcards
Here is one (not really recommended) solution.
Create a table
AddressSubstrings. This table would have multiple rows per address and the primary key of
When you insert an address into
table, insert substrings starting from each position. So, if you want to insert 'abcd', then you would insert:
along with the unique id of the row in Table. (This can all be done using a trigger.)
Create an index on
Then you can phrase your query as:
SELECT *FROM Table t JOIN AddressSubstrings ads ON t.table_id = ads.table_idWHERE ads.AddressSubstring LIKE 'nham%';
Now there will be a matching row starting with
like should make use of an index (and a full text index also works).
If you are interesting in the right way to handle this problem, a reasonable place to start is the Postgres documentation. This uses a method similar to the above, but using n-grams. The only problem with n-grams for your particular problem is that they require re-writing the comparison as well as changing the storing.
I can't offer a complete solution to this difficult problem.
But if you're looking to create a suffix search capability, in which, for example, you'd be able to find the row containing
ilson and the row containing
654, here's a suggestion.
WHERE REVERSE(textcolumn) LIKE REVERSE('ilson') + '%'
Of course this isn't sargable the way I wrote it here. But many modern DBMSs, including recent versions of SQL server, allow the definition, and indexing, of computed or virtual columns.
I've deployed this technique, to the delight of end users, in a health-care system with lots of record IDs like
Not without a serious preparation effort, hwilson1.
At the risk of repeating the obvious - any search path optimisation - leading to the decision whether an index is used, or which type of join operator to use, etc. (independently of which DBMS we're talking about) - works on equality (equal to) or range checking (greater-than and less-than).
With leading wildcards, you're out of luck.
The workaround is a serious preparation effort, as stated up front:
It would boil down to Vertica's text search feature, where that problem is solved. See here:
For any other database platform, including MS SQL, you'll have to do that manually.
In a nutshell: It relies on a primary key or unique identifier of the table whose text search you want to optimise.
You create an auxiliary table, whose primary key is the primary key of your base table, plus a sequence number, and a VARCHAR column that will contain a series of substrings of the base table's string you initially searched using wildcards. In an over-simplified way:
If your input table (just showing the columns that matter) is this:
id |the_search_col |other_col 42|The Restaurant at the End of the Universe|Arthur Dent 43|The Hitch-Hiker's Guide to the Galaxy |Ford Prefect
Your auxiliary search table could contain:
id |seq|search_token 42| 1|Restaurant 42| 2|End 42| 3|Universe 43| 1|Hitch-Hiker 43| 2|Guide 43| 3|Galaxy
Normally, you suppress typical "fillers" like articles and prepositions and apostrophe-s , and split into tokens separated by punctuation and white space. For your '%nham%' example, however, you'd probably need to talk to a linguist who has specialised in English morphology to find splitting token candidates .... :-]
You could start by the same technique that I use when I un-pivot a horizontal series of measures without the PIVOT clause, like here:
Then, use a combination of, probably nested, CHARINDEX() and SUBSTRING() using the index you get from the CROSS JOIN with a series of index integers as described in my post suggested above, and use that very index as the sequence for the auxiliary search table.
Lay an index on
search_token and you'll have a very fast access path to a big table.
Not a stroll in the park, I agree, but promising ...
Happy playing -
Marco the Sane