HTML compact

HTML compacting is one functionality that is missing in PHP. What does it do? Go to any google.com page and inspect the page source. You’d see all the unnecessary whitespace is removed. You would think it’s a small change, but look on some generated pages on other sites – especially one produced by CMSes using some designer templates – which are formatted to be human-readable and contain a bunch of comments, etc. – and you’d see that whitespace can easily double your page size.

It is not really possible, however, to make your HTML designers abandon the use of whitespace or comments. In fact, you won’t do it anyway – such code would be impossible to maintain. But an extension which can produce “compacted” HTML would be very nice. Like the tidy extension, but tidy doesn’t really know to do whitespace compacting, AFAIK. Maybe it can be modified to do that.

Sure, it’s not simple matter – given the amount of broken and outright crappy HTML out there. People actually can write something like <a href”url”> (yes, no = sign!) and it would work! And be sure, they do. And there’s CSS and Javascript.

About these ads

22 thoughts on “HTML compact

  1. The htmLawed PHP script can compact HTML. htmLawed is also an HTML filter and an alternative to using HTML Tidy; no need for external library or extension to PHP.

  2. you may want to check out this set of functions I’ve created for whitespace compression. It’s not always easy to get tidy installed on some servers. Also not the javascript compression used does have some limitations.


    <?php

    /**
    * whitespace
    *
    * Created by Oliver Lillie on 2007-08-25.
    * Copyright (c) 2007 Buggedcom. All rights reserved.
    */

    /**
    * Strips out whitespace from the buffer to the require compression level.
    *
    * @param string $buffer
    * @param int $compression_level 1 = only compress javascript and css tags, 2 sames as 1 but also compress horizontal whitespace, 3 same as 2 but also compresses vertical whitespace whilst preserving textarea and pre tags
    * @return string
    */
    function stripHTMLWhiteSpace($buffer, $compression_level=3)
    {
    switch($compression_level)
    {
    case 3 :
    $buffer = compressHorizontally($buffer);
    $buffer = compressVertically($buffer[0], $buffer[1]);
    $buffer = compressScriptAndStyleTags($buffer[0]);
    break;
    case 2 :
    $buffer = compressHorizontally($buffer);
    $buffer = compressScriptAndStyleTags($buffer[0]);
    break;
    case 1 :
    $buffer = compressScriptAndStyleTags($buffer);
    break;
    }
    return $buffer;
    }

    /**
    * Compresses white space horizontally (ie spaces, tabs etc) whilst preserving
    * textarea and pre content.
    * Idea and partial code borrowed from smarty.
    * http://smarty.php.net/contribs/plugins/view.php/outputfilter.trimwhitespace.php
    *
    * @param string $data
    * @return string
    */
    //
    function compressHorizontally($data, $preserved_blocks=false)
    {
    $reinstate = true;
    if(!$preserved_blocks)
    {
    $reinstate = false;
    // get the textarea matches
    preg_match_all("!]*>.*?!is", $data, $preserved_area_match);
    $preserved_blocks = $preserved_area_match[0];
    // replace the textareas inerds with markers
    $data = preg_replace("!]*>.*?!is", '@@@HTMLCOMPRESSION@@@', $data);
    }
    // remove the white space
    $data = preg_replace('/((?)\n)[\s]+/m', '\1', $data);
    // reinsert the textareas inners
    if($reinstate)
    {
    foreach($preserved_blocks as $curr_block)
    {
    $data = preg_replace("!@@@HTMLCOMPRESSION@@@!", $curr_block, $data, 1);
    }
    }
    return array($data, $preserved_blocks);
    }

    /**
    * Compresses white space vertically (ie line breaks) whilst preserving
    * textarea and pre content.
    *
    * @param string $data
    * @param mixed $textarea_blocks false if no textarea blocks have already been taken out, otherwise an array.
    * @return unknown
    */
    function compressVertically($data, $preserved_blocks=false)
    {
    $reinstate = true;
    if(!$preserved_blocks)
    {
    $reinstate = false;
    // get the textarea matches
    preg_match_all("!]*>.*?!is", $data, $preserved_area_match);
    $preserved_blocks = $preserved_area_match[0];
    // replace the textareas inerds with markers
    $data = preg_replace("!]*>.*?!is", '@@@HTMLCOMPRESSION@@@', $data);
    }
    $data = str_replace("\n", '', $data);
    // reinsert the textareas inerds
    if($reinstate)
    {
    foreach($preserved_blocks as $curr_block)
    {
    $data = preg_replace("!@@@HTMLCOMPRESSION@@@!", $curr_block, $data, 1);
    }
    }
    return array($data, $preserved_blocks);
    }

    /**
    * Compresses code (ie javascript and css) whitespace.
    *
    * @param string $code
    * @return string
    */
    function compressCode($code)
    {
    // Remove multiline comment
    $mlcomment = '/\/\*(?!-)[\x00-\xff]*?\*\//';
    $code = preg_replace($mlcomment,"",$code);
    // Remove single line comment
    $slcomment = '/[^:]\/\/.*/';
    $code = preg_replace($slcomment,"",$code);
    // Remove extra spaces
    $extra_space = '/\s+/';
    $code = preg_replace($extra_space," ",$code);
    // Remove spaces that can be removed
    $removable_space = '/\s?([\{\};\=\(\)\\\/\+\*-])\s?/';
    $code = preg_replace('/\s?([\{\};\=\(\)\/\+\*-])\s?/',"\\1",$code);
    return $code;
    }

    /**
    * Compresses the white space within script and style tags.
    *
    * @param string $data
    * @return string
    */
    function compressScriptAndStyleTags($data)
    {
    // pregmatch all the script tags
    $scripts = preg_match_all("!(]*>(?:\\s*\\s*)?)!is", $data, $scriptparts);
    // collect and compress the parts
    $compressed = array();
    $parts = array();
    for($i=0; $i<count($scriptparts[0]); $i++)
    {
    array_push($parts, $scriptparts[0][$i]);
    array_push($compressed, compressCode($scriptparts[0][$i]));
    }
    // do the replacements and return
    return str_replace($parts, $compressed, $data);
    }

    <?php

    /**
    * comments
    *
    * Created by Oliver Lillie on 2007-08-25.
    * Copyright (c) 2007 Buggedcom. All rights reserved.
    */

    /**
    * Strips HTML Comments from the buffer whilst making a check to see if
    * Inernet Explorer conditional comments should be stripped or not.
    *
    * @param string $buffer
    * @return string
    */
    function stripHTMLComments($buffer)
    {
    // check that the opening browser is internet explorer
    $msie = '/msie\s(.*).*(win)/i';
    $keep_conditionals = (isset($_SERVER['HTTP_USER_AGENT']) && preg_match($msie, $_SERVER['HTTP_USER_AGENT']));
    // $keep_doctype = false;
    // if(strpos($buffer, '<!DOCTYPE'))
    // {
    // $buffer = str_replace('<!DOCTYPE', '--**@@DOCTYPE@@**--', $buffer);
    // $keep_doctype = true;
    // }
    // ie conditionals are to be kept so substitute
    if($keep_conditionals)
    {
    $buffer = str_replace(array('<!--[if', ''), array('--**@@IECOND-OPEN@@**--', '--**@@IECOND-CLOSE@@**--'), $buffer);
    }
    // remove comments
    $buffer = preg_replace('//', '', $buffer);
    // $buffer = preg_replace ('@@', '', $buffer);
    // re sub-in the conditionals if required.
    if($keep_conditionals)
    {
    $buffer = str_replace(array('--**@@IECOND-OPEN@@**--', '--**@@IECOND-CLOSE@@**--'), array('<!--[if', ''), $buffer);
    }
    // if($keep_doctype)
    // {
    // $buffer = str_replace('--**@@DOCTYPE@@**--', '<!DOCTYPE', $buffer);
    // }
    // return the buffer
    return $buffer;
    }

    usage

    $text = 'etc etc html content ';
    $text = stripHTMLComments($text);
    $text = stripHTMLWhiteSpace($text, 3);
    echo $text;

  3. Fixed a problem in my previous code, and turned it into a function.

    function compact_html($page)
    {
    $tidy_config = array(
    ‘clean’ => 1,
    ‘bare’ => 1,
    ‘hide-comments’ => 1,
    ‘indent-spaces’ => 0,
    ‘tab-size’ => 1,
    ‘wrap’ => 0,
    ‘preserve-entities’ => 1,
    ‘indent’ => 0,
    ‘break-before-br’ => 0,
    ‘output-xhtml’ => 1,
    );

    $tidy = new tidy;
    $tidy->parseString($page, $tidy_config);
    $tidy->cleanRepair();
    // Fold any new lines that are surrounded by an close then open tag.
    return str_replace(“>\n<‘, $tidy);
    }

  4. This is how I do it, works well and doesn’t ruin the line breaks in pre-formatted blocks etc.

    $config = array(
    ‘clean’ => 1,
    ‘bare’ => 1,
    ‘hide-comments’ => 1,
    ‘indent-spaces’ => 0,
    ‘tab-size’ => 1,
    ‘wrap’ => 0,
    ‘preserve-entities’ => 1,
    ‘indent’ => 0,
    ‘break-before-br’ => 0,
    ‘output-xhtml’ => 1,
    );

    $tidy = new tidy;
    $tidy->parseString($page, $config);
    $tidy->cleanRepair();
    $page = (string) $tidy;
    $page = str_replace(“>\n”, ‘>’, $page);

  5. Unfortunately, it’s not that easy – proposed change could kill styles and Javascript, especially if things like one-line comments are in use. Also it would not strip HTML comments.

  6. you could use these functions on your raw html code:

    (using output buffer as always) then:

    $html = str_replace(“\r\n”, “”, $html);
    $html = str_replace(“\n”, “”, $html);
    $html = str_replace(“\t”, “”, $html);
    $html = str_replace(” “, “”, $html);
    $html = preg_replace(‘/value=”([0-9]*)”/’, ‘value=\1′, $html1);

    if your html is good then this will remove quite a lot of the extravenous stuff and return a google like source (and im sure it can be tweaked further, this was just off the top of my head)

  7. Well, IE comments aren’t described in any w3c standard, but they are de-facto widely used. So it would be nice to support them.
    Changing escape-cdata doesn’t seem to eliminate CDATA blocks for me.

  8. IE comments are pretty non-standard, so I doubt libtidy plans to add any support for them. The other things seems like library oversights to me. The CDATA thing may be addressed by escape-cdata option.

  9. I used zend.com homepage :)
    Generally, tidy _almost_ does it, but it does not do this:
    1. Remove linebreaks when possible (it’s not trivial – you can’t touch CSS and Javascript)
    2. Not put CDATA in every place where CSS/JS is (maybe some config option I missed)
    3. Strip comments, but not IE-comments – i.e. leave those !–[if lte IE 6.0] alone, and leave CSS/JS alone of course, but remove the rest

    If it learns to do these three – I need nothing else. :)

  10. A combination of the following options should be what you want:

    clean=1
    bare=1
    hide-comments=1
    doctype=omit
    indent-spaces=0
    tab-size=0
    wrap=0
    quote-ampersand=0

  11. Google HTML works with any browser known to me, which is by itself a good point for it. Also, there’s nothing I know that may not allow to produce HTML with various degrees of “compactness” – e.g. most problem with Google HTML are because it doesn’t wrap attributes with “”s and doesn’t escape &’s. The extension could easily make this optional or even omit it while stripping all extra whitespace. Can tidy do at least full whitespace stripping? I didn’t find even option for that…

  12. Stas, Google’s html is not complaint with any specs, even the very loose 4.01 check shows that Google’s code on the front page alone contains 47 errors of varying complexity. Tidy will only produce valid HTML, so it’d never quite reach the somewhat extreme levels used at Google.

Comments are closed.