Cleaning HTML by removing extra/redundant formatting tags Cleaning HTML by removing extra/redundant formatting tags php php

Cleaning HTML by removing extra/redundant formatting tags


Introduction

The best solution have seen so far is using HTML Tidy http://tidy.sourceforge.net/

Beyond converting the format of a document, Tidy is also able to convert deprecated HTML tags into their cascading style sheet (CSS) counterparts automatically through the use of the clean option. The generated output contains an inline style declaration.

It also ensures that the HTML document is xhtml compatible

Example

$code ='<p> <strong>  <span style="font-size: 14px">   <span style="color: #006400">     <span style="font-size: 14px">      <span style="font-size: 16px">       <span style="color: #006400">        <span style="font-size: 14px">         <span style="font-size: 16px">          <span style="color: #006400">This is a </span>         </span>        </span>       </span>      </span>     </span>    </span>    <span style="color: #006400">     <span style="font-size: 16px">      <span style="color: #b22222">Test</span>     </span>    </span>   </span>  </span> </strong></p>';

If you RUN

$clean = cleaning($code);print($clean['body']);

Output

<p>    <strong>        <span class="c3">            <span class="c1">This is a</span>                 <span class="c2">Test</span>            </span>        </strong></p>

You can get the CSS

$clean = cleaning($code);print($clean['style']);

Output

<style type="text/css">    span.c3 {        font-size: 14px    }    span.c2 {        color: #006400;        font-size: 16px    }    span.c1 {        color: #006400;        font-size: 14px    }</style>

Our the FULL HTML

$clean = cleaning($code);print($clean['full']);

Output

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml">  <head>    <title></title>    <style type="text/css">/*<![CDATA[*/    span.c3 {font-size: 14px}    span.c2 {color: #006400; font-size: 16px}    span.c1 {color: #006400; font-size: 14px}    /*]]>*/    </style>  </head>  <body>    <p>      <strong><span class="c3"><span class="c1">This is a</span>      <span class="c2">Test</span></span></strong>    </p>  </body></html>

Function Used

function cleaning($string, $tidyConfig = null) {    $out = array ();    $config = array (            'indent' => true,            'show-body-only' => false,            'clean' => true,            'output-xhtml' => true,            'preserve-entities' => true     );    if ($tidyConfig == null) {        $tidyConfig = &$config;    }    $tidy = new tidy ();    $out ['full'] = $tidy->repairString ( $string, $tidyConfig, 'UTF8' );    unset ( $tidy );    unset ( $tidyConfig );    $out ['body'] = preg_replace ( "/.*<body[^>]*>|<\/body>.*/si", "", $out ['full'] );    $out ['style'] = '<style type="text/css">' . preg_replace ( "/.*<style[^>]*>|<\/style>.*/si", "", $out ['full'] ) . '</style>';    return ($out);}

================================================

Edit 1 : Dirty Hack (Not Recommended)

================================================

Based on your last comment its like you want to retain the depreciate style .. HTML Tidy might not allow you to do that since its depreciated but you can do this

$out = cleaning ( $code );$getStyle = new css2string ();$getStyle->parseStr ( $out ['style'] );$body = $out ['body'];$search = array ();$replace = array ();foreach ( $getStyle->css as $key => $value ) {    list ( $selector, $name ) = explode ( ".", $key );    $search [] = "<$selector class=\"$name\">";    $style = array ();    foreach ( $value as $type => $att ) {        $style [] = "$type:$att";    }    $replace [] = "<$selector style=\"" . implode ( ";", $style ) . ";\">";}

Output

<p>  <strong>      <span style="font-size:14px;">        <span style="color:#006400;font-size:14px;">This is a</span>        <span style="color:#006400;font-size:16px;">Test</span>        </span>  </strong></p>

Class Used

//Credit : http://stackoverflow.com/a/8511837/1226894class css2string {var $css;function parseStr($string) {    preg_match_all ( '/(?ims)([a-z0-9, \s\.\:#_\-@]+)\{([^\}]*)\}/', $string, $arr );    $this->css = array ();    foreach ( $arr [0] as $i => $x ) {        $selector = trim ( $arr [1] [$i] );        $rules = explode ( ';', trim ( $arr [2] [$i] ) );        $this->css [$selector] = array ();        foreach ( $rules as $strRule ) {            if (! empty ( $strRule )) {                $rule = explode ( ":", $strRule );                $this->css [$selector] [trim ( $rule [0] )] = trim ( $rule [1] );            }        }    }}function arrayImplode($glue, $separator, $array) {    if (! is_array ( $array ))        return $array;    $styleString = array ();    foreach ( $array as $key => $val ) {        if (is_array ( $val ))            $val = implode ( ',', $val );        $styleString [] = "{$key}{$glue}{$val}";    }    return implode ( $separator, $styleString );}function getSelector($selectorName) {    return $this->arrayImplode ( ":", ";", $this->css [$selectorName] );}}


Here is a solution that uses the browser to get the nested element's properties. No need to cascade the properties up, since the css computed styles is ready to read from the browser.

Here is an example: http://jsfiddle.net/mmeah/fUpe8/3/

var fixedCode = readNestProp($("#redo"));$("#simp").html( fixedCode );function readNestProp(el){ var output = ""; $(el).children().each( function(){    if($(this).children().length==0){        var _that=this;        var _cssAttributeNames = ["font-size","color"];        var _tag = $(_that).prop("nodeName").toLowerCase();        var _text = $(_that).text();        var _style = "";        $.each(_cssAttributeNames, function(_index,_value){            var css_value = $(_that).css(_value);            if(typeof css_value!= "undefined"){                _style += _value + ":";                _style += css_value + ";";            }        });        output += "<"+_tag+" style='"+_style+"'>"+_text+"</"+_tag+">";    }else if(        $(this).prop("nodeName").toLowerCase() !=        $(this).find(">:first-child").prop("nodeName").toLowerCase()    ){        var _tag = $(this).prop("nodeName").toLowerCase();        output += "<"+_tag+">" + readNestProp(this) + "</"+_tag+">";    }else{        output += readNestProp(this);    }; }); return output;}

A better solution to typing in all possible css attributes like:
var _cssAttributeNames = ["font-size","color"];
Is to use a solution like mentioned here:Can jQuery get all CSS styles associated with an element?


You should look into HTMLPurifier, it's a great tool for parsing HTML and removing unnecessary and unsafe content from it. Look into the removing empty spans configs and stuff. It can be a bit of a beast to configure I admit, but that's only because it's so versatile.

It's also quite heavy, so you'd want to save the output of it the database (As opposed to reading the raw from the database and then parsing it with purifier every time.