SQL - Remove all HTML tags in a string
DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'SELECT t.c.value('.', 'NVARCHAR(MAX)')FROM @x.nodes('*') t(c)
Update - For strings with unclosed tags:
DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'SELECT x.value('.', 'NVARCHAR(MAX)')FROM ( SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)) r
If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).
Here's an example to output the text from all tags:
declare @htmlData nvarchar(100) = '<html><head></head><body> <p>My text.</p> <p>My other text.</p></body></html>';select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');select cast(@htmlData as XML).value('.', 'nvarchar(max)');
Note that there's a difference in the output of whitespace between cast and convert.
To only get content from a specific node, the XQuery syntax is used. (XQuery is based on the XPath syntax)
For example:
select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');
Result : My text.
Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing
error.
If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.
select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');
or via a funky recursive way:
declare @xmlData nvarchar(100);WITH Lines(n, x, y) AS ( SELECT 1, 1, CHARINDEX(char(13), @htmlData) UNION ALL SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines WHERE y > 0)SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))FROM Lineswhere PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0order by n;select @xmlData as xmlData, convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;
Firstly create a user defined function that strips the HTML out like so:
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))RETURNS VARCHAR(MAX)AS BEGIN DECLARE @Start INT; DECLARE @End INT; DECLARE @Length INT; SET @Start = CHARINDEX('<', @HTMLText); SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)); SET @Length = (@End - @Start) + 1; WHILE @Start > 0 AND @End > 0 AND @Length > 0 BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ''); SET @Start = CHARINDEX('<', @HTMLText); SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)); SET @Length = (@End - @Start) + 1; END; RETURN LTRIM(RTRIM(@HTMLText)); END;GO
When you're trying to select it:
SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE
This should lead to you avoiding to have to use several nested replace statements.
Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/