SQL - Remove all HTML tags in a string

html sql sql-server xml tsql

DECLARE @x XML = '<html><head></head><body><p>My text.</p></body></html>'SELECT t.c.value('.', 'NVARCHAR(MAX)')FROM @x.nodes('*') t(c)

Update - For strings with unclosed tags:

DECLARE @x NVARCHAR(MAX) = '<html><head></head><body><p>My text.<br>More text.</p></body></html>'SELECT x.value('.', 'NVARCHAR(MAX)')FROM (    SELECT x = CAST(REPLACE(REPLACE(@x, '>', '/>'), '</', '<') AS XML)) r

html sql sql-server xml tsql

If the HTML is well formed then there's no need to use replace to parse XML.
Just cast or convert it to an XML type and get the value(s).

Here's an example to output the text from all tags:

declare @htmlData nvarchar(100) = '<html><head></head><body>   <p>My text.</p>   <p>My other text.</p></body></html>';select convert(XML,@htmlData,1).value('.', 'nvarchar(max)');select cast(@htmlData as XML).value('.', 'nvarchar(max)');

Note that there's a difference in the output of whitespace between cast and convert.

To only get content from a specific node, the XQuery syntax is used. (XQuery is based on the XPath syntax)

For example:

select cast(@htmlData as XML).value('(//body/p/node())[1]', 'nvarchar(max)');select convert(XML,@htmlData,1).value('(//body/p/node())[1]', 'nvarchar(max)');

Result : My text.

Of course, this still assumes a valid XML.
If for example, a closing tag is missing then this would raise an XML parsing error.

If the HTML isn't well formed as an XML, then one could use PATINDEX & SUBSTRING to get the first p tag. And then cast that to an XML type to get the value.

select cast(SUBSTRING(@htmlData,patindex('%<p>%',@htmlData),patindex('%</p>%',@htmlData) - patindex('%<p>%',@htmlData)+4) as xml).value('.','nvarchar(max)');

or via a funky recursive way:

declare @xmlData nvarchar(100);WITH Lines(n, x, y) AS (  SELECT 1, 1, CHARINDEX(char(13), @htmlData)  UNION ALL  SELECT n+1, y+1, CHARINDEX(char(13), @htmlData, y+1) FROM Lines  WHERE y > 0)SELECT @xmlData = concat(@xmlData,SUBSTRING(@htmlData,x,IIF(y>0,y-x,8)))FROM Lineswhere PATINDEX('%<p>%</p>%', SUBSTRING(@htmlData,x,IIF(y>0,y-x,10))) > 0order by n;select @xmlData as xmlData, convert(XML,@xmlData,1).value('(/p/node())[1]', 'nvarchar(max)') as FirstP;

html sql sql-server xml tsql

Firstly create a user defined function that strips the HTML out like so:

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))RETURNS VARCHAR(MAX)AS     BEGIN         DECLARE @Start INT;         DECLARE @End INT;         DECLARE @Length INT;         SET @Start = CHARINDEX('<', @HTMLText);         SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));         SET @Length = (@End - @Start) + 1;         WHILE @Start > 0               AND @End > 0               AND @Length > 0             BEGIN                 SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '');                 SET @Start = CHARINDEX('<', @HTMLText);                 SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText));                 SET @Length = (@End - @Start) + 1;             END;         RETURN LTRIM(RTRIM(@HTMLText));     END;GO

When you're trying to select it:

SELECT dbo.udf_StripHTML([column]) FROM SOMETABLE

This should lead to you avoiding to have to use several nested replace statements.

Credit and further info: http://blog.sqlauthority.com/2007/06/16/sql-server-udf-user-defined-function-to-strip-html-parse-html-no-regular-expression/

CodeHunter

SQL - Remove all HTML tags in a string

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last