SQL - Remove all HTML tags in a string SQL - Remove all HTML tags in a string xml xml

SQL - Remove all HTML tags in a string


DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'SELECT t.c.value('.', 'NVARCHAR(MAX)')FROM @x.nodes('*') t(c)

Update - For strings with unclosed tags:

DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'SELECT x.value('.', 'NVARCHAR(MAX)')FROM (    SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)) r


If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).

Here's an example to output the text from all tags:

declare @htmlData nvarchar(100) = '<html><head></head><body>   <p>My text.</p>   <p>My other text.</p></body></html>';select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');select cast(@htmlData as XML).value('.', 'nvarchar(max)');

Note that there's a difference in the output of whitespace between cast and convert.

To only get content from a specific node, the XQuery syntax is used. (XQuery is based on the XPath syntax)

For example:

select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');

Result : My text.

Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing error.

If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.

select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');

or via a funky recursive way:

declare @xmlData nvarchar(100);WITH Lines(n, x, y) AS (  SELECT 1, 1, CHARINDEX(char(13), @htmlData)  UNION ALL  SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines  WHERE y > 0)SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))FROM Lineswhere PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0order by n;select @xmlData as xmlData, convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;


Firstly create a user defined function that strips the HTML out like so:

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))RETURNS VARCHAR(MAX)AS     BEGIN         DECLARE @Start INT;         DECLARE @End INT;         DECLARE @Length INT;         SET @Start = CHARINDEX('<', @HTMLText);         SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));         SET @Length = (@End - @Start) + 1;         WHILE @Start > 0               AND @End > 0               AND @Length > 0             BEGIN                 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');                 SET @Start = CHARINDEX('<', @HTMLText);                 SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));                 SET @Length = (@End - @Start) + 1;             END;         RETURN LTRIM(RTRIM(@HTMLText));     END;GO

When you're trying to select it:

SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE

This should lead to you avoiding to have to use several nested replace statements.

Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/