How to ignore html tags in Sql Server 2008 Full Text Search How to ignore html tags in Sql Server 2008 Full Text Search sql-server sql-server

How to ignore html tags in Sql Server 2008 Full Text Search


there is a filter for .htm and .html files.

to see if you have the filter installed run this sql:

SELECT * FROM sys.fulltext_document_types

you should see:

.htm E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation.html E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation

so, if you can convert your articles column to varbinary(max), then you can add a full text index on it and specify a doc type of '.html'

once the index has populated, you can verify the keywords using this sql:

SELECT display_term, column_id, document_countFROM sys.dm_fts_index_keywords(DB_ID('your_db'), OBJECT_ID('your_table')) 


Please check for these:

1) In SQL Server Full Text, we can define noise words/Stopwords. You can edit the Noise world file and then you have to rebuild the catalog. So you can put all the html tags as noise. Please check

http://msdn.microsoft.com/en-us/library/ms142551.aspx

2) With track changes it automatically include the changes in current full text search, but the ranking of these newly added article gets changed from the previous. So until and unless you master index is synced it will give up and down with ranking.

3) As far as i know we can implement custom filters, stemmers and word breakers and can plug into SQL Server full text search.By default i may not know the complete list, but it does doc and pdf.

For more information on SQL Server full text search 2008 please check:

http://technet.microsoft.com/en-us/library/cc721269.aspx