Cleaning HTML by removing extra/redundant formatting tags
Introduction
The best solution have seen so far is using HTML Tidy
http://tidy.sourceforge.net/
Beyond converting the format of a document, Tidy is also able to convert deprecated HTML tags into their cascading style sheet (CSS) counterparts automatically through the use of the clean option. The generated output contains an inline style declaration.
It also ensures that the HTML document is xhtml
compatible
Example
$code ='<p> <strong> <span style="font-size: 14px"> <span style="color: #006400"> <span style="font-size: 14px"> <span style="font-size: 16px"> <span style="color: #006400"> <span style="font-size: 14px"> <span style="font-size: 16px"> <span style="color: #006400">This is a </span> </span> </span> </span> </span> </span> </span> <span style="color: #006400"> <span style="font-size: 16px"> <span style="color: #b22222">Test</span> </span> </span> </span> </span> </strong></p>';
If you RUN
$clean = cleaning($code);print($clean['body']);
Output
<p> <strong> <span class="c3"> <span class="c1">This is a</span> <span class="c2">Test</span> </span> </strong></p>
You can get the CSS
$clean = cleaning($code);print($clean['style']);
Output
<style type="text/css"> span.c3 { font-size: 14px } span.c2 { color: #006400; font-size: 16px } span.c1 { color: #006400; font-size: 14px }</style>
Our the FULL HTML
$clean = cleaning($code);print($clean['full']);
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> <style type="text/css">/*<![CDATA[*/ span.c3 {font-size: 14px} span.c2 {color: #006400; font-size: 16px} span.c1 {color: #006400; font-size: 14px} /*]]>*/ </style> </head> <body> <p> <strong><span class="c3"><span class="c1">This is a</span> <span class="c2">Test</span></span></strong> </p> </body></html>
Function Used
function cleaning($string, $tidyConfig = null) { $out = array (); $config = array ( 'indent' => true, 'show-body-only' => false, 'clean' => true, 'output-xhtml' => true, 'preserve-entities' => true ); if ($tidyConfig == null) { $tidyConfig = &$config; } $tidy = new tidy (); $out ['full'] = $tidy->repairString ( $string, $tidyConfig, 'UTF8' ); unset ( $tidy ); unset ( $tidyConfig ); $out ['body'] = preg_replace ( "/.*<body[^>]*>|<\/body>.*/si", "", $out ['full'] ); $out ['style'] = '<style type="text/css">' . preg_replace ( "/.*<style[^>]*>|<\/style>.*/si", "", $out ['full'] ) . '</style>'; return ($out);}
================================================
Edit 1 : Dirty Hack (Not Recommended)
================================================
Based on your last comment its like you want to retain the depreciate style .. HTML Tidy
might not allow you to do that since its depreciated
but you can do this
$out = cleaning ( $code );$getStyle = new css2string ();$getStyle->parseStr ( $out ['style'] );$body = $out ['body'];$search = array ();$replace = array ();foreach ( $getStyle->css as $key => $value ) { list ( $selector, $name ) = explode ( ".", $key ); $search [] = "<$selector class=\"$name\">"; $style = array (); foreach ( $value as $type => $att ) { $style [] = "$type:$att"; } $replace [] = "<$selector style=\"" . implode ( ";", $style ) . ";\">";}
Output
<p> <strong> <span style="font-size:14px;"> <span style="color:#006400;font-size:14px;">This is a</span> <span style="color:#006400;font-size:16px;">Test</span> </span> </strong></p>
Class Used
//Credit : http://stackoverflow.com/a/8511837/1226894class css2string {var $css;function parseStr($string) { preg_match_all ( '/(?ims)([a-z0-9, \s\.\:#_\-@]+)\{([^\}]*)\}/', $string, $arr ); $this->css = array (); foreach ( $arr [0] as $i => $x ) { $selector = trim ( $arr [1] [$i] ); $rules = explode ( ';', trim ( $arr [2] [$i] ) ); $this->css [$selector] = array (); foreach ( $rules as $strRule ) { if (! empty ( $strRule )) { $rule = explode ( ":", $strRule ); $this->css [$selector] [trim ( $rule [0] )] = trim ( $rule [1] ); } } }}function arrayImplode($glue, $separator, $array) { if (! is_array ( $array )) return $array; $styleString = array (); foreach ( $array as $key => $val ) { if (is_array ( $val )) $val = implode ( ',', $val ); $styleString [] = "{$key}{$glue}{$val}"; } return implode ( $separator, $styleString );}function getSelector($selectorName) { return $this->arrayImplode ( ":", ";", $this->css [$selectorName] );}}
Here is a solution that uses the browser to get the nested element's properties. No need to cascade the properties up, since the css computed styles is ready to read from the browser.
Here is an example: http://jsfiddle.net/mmeah/fUpe8/3/
var fixedCode = readNestProp($("#redo"));$("#simp").html( fixedCode );function readNestProp(el){ var output = ""; $(el).children().each( function(){ if($(this).children().length==0){ var _that=this; var _cssAttributeNames = ["font-size","color"]; var _tag = $(_that).prop("nodeName").toLowerCase(); var _text = $(_that).text(); var _style = ""; $.each(_cssAttributeNames, function(_index,_value){ var css_value = $(_that).css(_value); if(typeof css_value!= "undefined"){ _style += _value + ":"; _style += css_value + ";"; } }); output += "<"+_tag+" style='"+_style+"'>"+_text+"</"+_tag+">"; }else if( $(this).prop("nodeName").toLowerCase() != $(this).find(">:first-child").prop("nodeName").toLowerCase() ){ var _tag = $(this).prop("nodeName").toLowerCase(); output += "<"+_tag+">" + readNestProp(this) + "</"+_tag+">"; }else{ output += readNestProp(this); }; }); return output;}
A better solution to typing in all possible css attributes like:
var _cssAttributeNames = ["font-size","color"];
Is to use a solution like mentioned here:Can jQuery get all CSS styles associated with an element?
You should look into HTMLPurifier, it's a great tool for parsing HTML and removing unnecessary and unsafe content from it. Look into the removing empty spans configs and stuff. It can be a bit of a beast to configure I admit, but that's only because it's so versatile.
It's also quite heavy, so you'd want to save the output of it the database (As opposed to reading the raw from the database and then parsing it with purifier every time.