Cleaning HTML by removing extra/redundant formatting tags

Introduction

The best solution have seen so far is using HTML Tidy http://tidy.sourceforge.net/

Beyond converting the format of a document, Tidy is also able to convert deprecated HTML tags into their cascading style sheet (CSS) counterparts automatically through the use of the clean option. The generated output contains an inline style declaration.

It also ensures that the HTML document is xhtml compatible

Example

$code ='<p> <strong>  <span style="font-size: 14px">   <span style="color: #006400">     <span style="font-size: 14px">      <span style="font-size: 16px">       <span style="color: #006400">        <span style="font-size: 14px">         <span style="font-size: 16px">          <span style="color: #006400">This is a </span>         </span>        </span>       </span>      </span>     </span>    </span>    <span style="color: #006400">     <span style="font-size: 16px">      <span style="color: #b22222">Test</span>     </span>    </span>   </span>  </span> </strong></p>';

If you RUN

$clean = cleaning($code);print($clean['body']);

Output

<p>    <strong>        <span class="c3">            <span class="c1">This is a</span>                 <span class="c2">Test</span>            </span>        </strong></p>

You can get the CSS

$clean = cleaning($code);print($clean['style']);

Output

<style type="text/css">    span.c3 {        font-size: 14px    }    span.c2 {        color: #006400;        font-size: 16px    }    span.c1 {        color: #006400;        font-size: 14px    }</style>

Our the FULL HTML

$clean = cleaning($code);print($clean['full']);

Output

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml">  <head>    <title></title>    <style type="text/css">/*<![CDATA[*/    span.c3 {font-size: 14px}    span.c2 {color: #006400; font-size: 16px}    span.c1 {color: #006400; font-size: 14px}    /*]]>*/    </style>  </head>  <body>    <p>      <strong><span class="c3"><span class="c1">This is a</span>      <span class="c2">Test</span></span></strong>    </p>  </body></html>

Function Used

function cleaning($string, $tidyConfig = null) {    $out = array ();    $config = array (            'indent' => true,            'show-body-only' => false,            'clean' => true,            'output-xhtml' => true,            'preserve-entities' => true     );    if ($tidyConfig == null) {        $tidyConfig = &$config;    }    $tidy = new tidy ();    $out ['full'] = $tidy->repairString ( $string, $tidyConfig, 'UTF8' );    unset ( $tidy );    unset ( $tidyConfig );    $out ['body'] = preg_replace ( "/.*<body[^>]*>|<\/body>.*/si", "", $out ['full'] );    $out ['style'] = '<style type="text/css">' . preg_replace ( "/.*<style[^>]*>|<\/style>.*/si", "", $out ['full'] ) . '</style>';    return ($out);}

================================================

Edit 1 : Dirty Hack (Not Recommended)

================================================

Based on your last comment its like you want to retain the depreciate style .. HTML Tidy might not allow you to do that since its depreciated but you can do this

$out = cleaning ( $code );$getStyle = new css2string ();$getStyle->parseStr ( $out ['style'] );$body = $out ['body'];$search = array ();$replace = array ();foreach ( $getStyle->css as $key => $value ) {    list ( $selector, $name ) = explode ( ".", $key );    $search [] = "<$selector class=\"$name\">";    $style = array ();    foreach ( $value as $type => $att ) {        $style [] = "$type:$att";    }    $replace [] = "<$selector style=\"" . implode ( ";", $style ) . ";\">";}

Output

<p>  <strong>      <span style="font-size:14px;">        <span style="color:#006400;font-size:14px;">This is a</span>        <span style="color:#006400;font-size:16px;">Test</span>        </span>  </strong></p>

Class Used

//Credit : http://stackoverflow.com/a/8511837/1226894class css2string {var $css;function parseStr($string) {    preg_match_all ( '/(?ims)([a-z0-9, \s\.\:#_\-@]+)\{([^\}]*)\}/', $string, $arr );    $this->css = array ();    foreach ( $arr [0] as $i => $x ) {        $selector = trim ( $arr [1] [$i] );        $rules = explode ( ';', trim ( $arr [2] [$i] ) );        $this->css [$selector] = array ();        foreach ( $rules as $strRule ) {            if (! empty ( $strRule )) {                $rule = explode ( ":", $strRule );                $this->css [$selector] [trim ( $rule [0] )] = trim ( $rule [1] );            }        }    }}function arrayImplode($glue, $separator, $array) {    if (! is_array ( $array ))        return $array;    $styleString = array ();    foreach ( $array as $key => $val ) {        if (is_array ( $val ))            $val = implode ( ',', $val );        $styleString [] = "{$key}{$glue}{$val}";    }    return implode ( $separator, $styleString );}function getSelector($selectorName) {    return $this->arrayImplode ( ":", ";", $this->css [$selectorName] );}}

php html dom html-parsing bbcode

Here is a solution that uses the browser to get the nested element's properties. No need to cascade the properties up, since the css computed styles is ready to read from the browser.

Here is an example: http://jsfiddle.net/mmeah/fUpe8/3/

var fixedCode = readNestProp($("#redo"));$("#simp").html( fixedCode );function readNestProp(el){ var output = ""; $(el).children().each( function(){    if($(this).children().length==0){        var _that=this;        var _cssAttributeNames = ["font-size","color"];        var _tag = $(_that).prop("nodeName").toLowerCase();        var _text = $(_that).text();        var _style = "";        $.each(_cssAttributeNames, function(_index,_value){            var css_value = $(_that).css(_value);            if(typeof css_value!= "undefined"){                _style += _value + ":";                _style += css_value + ";";            }        });        output += "<"+_tag+" style='"+_style+"'>"+_text+"</"+_tag+">";    }else if(        $(this).prop("nodeName").toLowerCase() !=        $(this).find(">:first-child").prop("nodeName").toLowerCase()    ){        var _tag = $(this).prop("nodeName").toLowerCase();        output += "<"+_tag+">" + readNestProp(this) + "</"+_tag+">";    }else{        output += readNestProp(this);    }; }); return output;}

A better solution to typing in all possible css attributes like:
var _cssAttributeNames = ["font-size","color"];
Is to use a solution like mentioned here:Can jQuery get all CSS styles associated with an element?

php html dom html-parsing bbcode

You should look into HTMLPurifier, it's a great tool for parsing HTML and removing unnecessary and unsafe content from it. Look into the removing empty spans configs and stuff. It can be a bit of a beast to configure I admit, but that's only because it's so versatile.

It's also quite heavy, so you'd want to save the output of it the database (As opposed to reading the raw from the database and then parsing it with purifier every time.

CodeHunter

Cleaning HTML by removing extra/redundant formatting tags

Introduction

Example

Function Used

Edit 1 : Dirty Hack (Not Recommended)

Class Used

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last