PHP 10.0 Blog

What if…

HTML compact

Posted by Stas on October 30, 2006

HTML compacting is one functionality that is missing in PHP. What does it do? Go to any google.com page and inspect the page source. You’d see all the unnecessary whitespace is removed. You would think it’s a small change, but look on some generated pages on other sites – especially one produced by CMSes using some designer templates – which are formatted to be human-readable and contain a bunch of comments, etc. – and you’d see that whitespace can easily double your page size.

It is not really possible, however, to make your HTML designers abandon the use of whitespace or comments. In fact, you won’t do it anyway – such code would be impossible to maintain. But an extension which can produce “compacted” HTML would be very nice. Like the tidy extension, but tidy doesn’t really know to do whitespace compacting, AFAIK. Maybe it can be modified to do that.

Sure, it’s not simple matter – given the amount of broken and outright crappy HTML out there. People actually can write something like <a href”url”> (yes, no = sign!) and it would work! And be sure, they do. And there’s CSS and Javascript.

About these ads

22 Responses to “HTML compact”

  1. ilia said

    Tidy can do html compacting quite well given a detailed config on what todo, however it’ll always aim to produce standards compliant HTML so tricks like that iirc are not standards complaint will not be done.

  2. Stas said

    I could not find how to make tidy to produce the HTML like google has.

  3. ilia said

    Stas, Google’s html is not complaint with any specs, even the very loose 4.01 check shows that Google’s code on the front page alone contains 47 errors of varying complexity. Tidy will only produce valid HTML, so it’d never quite reach the somewhat extreme levels used at Google.

  4. Stas said

    Google HTML works with any browser known to me, which is by itself a good point for it. Also, there’s nothing I know that may not allow to produce HTML with various degrees of “compactness” – e.g. most problem with Google HTML are because it doesn’t wrap attributes with “”s and doesn’t escape &’s. The extension could easily make this optional or even omit it while stripping all extra whitespace. Can tidy do at least full whitespace stripping? I didn’t find even option for that…

  5. ilia said

    A combination of the following options should be what you want:

    clean=1
    bare=1
    hide-comments=1
    doctype=omit
    indent-spaces=0
    tab-size=0
    wrap=0
    quote-ampersand=0

  6. Stas said

    Tried it, it removed some whitespace but definitely not all of it.

  7. ilia said

    There are additional space removing tags available for libtidy, you may want to review the option dictionary on their site.

  8. Stas said

    Well, I did, but couldn’t find the option that allows to strip all unnecessary whitespace…

  9. ilia said

    Do you have a sample HTML file you are using as a test case, if so, is it possible to get it?

  10. Stas said

    I used zend.com homepage :)
    Generally, tidy _almost_ does it, but it does not do this:
    1. Remove linebreaks when possible (it’s not trivial – you can’t touch CSS and Javascript)
    2. Not put CDATA in every place where CSS/JS is (maybe some config option I missed)
    3. Strip comments, but not IE-comments – i.e. leave those !–[if lte IE 6.0] alone, and leave CSS/JS alone of course, but remove the rest

    If it learns to do these three – I need nothing else. :)

  11. ilia said

    IE comments are pretty non-standard, so I doubt libtidy plans to add any support for them. The other things seems like library oversights to me. The CDATA thing may be addressed by escape-cdata option.

  12. Stas said

    Well, IE comments aren’t described in any w3c standard, but they are de-facto widely used. So it would be nice to support them.
    Changing escape-cdata doesn’t seem to eliminate CDATA blocks for me.

  13. saumendra said

    ILia was right in pointing that the TIDY libraries , specially the tidy-html can do the compacting and also will comply to the web standards.

  14. Stas said

    As I noted, I wasn’t able to get tidy to perform it as good as I wanted

  15. Menatas said

    you could use these functions on your raw html code:

    (using output buffer as always) then:

    $html = str_replace(“\r\n”, “”, $html);
    $html = str_replace(“\n”, “”, $html);
    $html = str_replace(“\t”, “”, $html);
    $html = str_replace(” “, “”, $html);
    $html = preg_replace(‘/value=”([0-9]*)”/’, ‘value=\1′, $html1);

    if your html is good then this will remove quite a lot of the extravenous stuff and return a google like source (and im sure it can be tweaked further, this was just off the top of my head)

  16. Stas said

    Unfortunately, it’s not that easy – proposed change could kill styles and Javascript, especially if things like one-line comments are in use. Also it would not strip HTML comments.

  17. Chris said

    This is how I do it, works well and doesn’t ruin the line breaks in pre-formatted blocks etc.

    $config = array(
    ‘clean’ => 1,
    ‘bare’ => 1,
    ‘hide-comments’ => 1,
    ‘indent-spaces’ => 0,
    ‘tab-size’ => 1,
    ‘wrap’ => 0,
    ‘preserve-entities’ => 1,
    ‘indent’ => 0,
    ‘break-before-br’ => 0,
    ‘output-xhtml’ => 1,
    );

    $tidy = new tidy;
    $tidy->parseString($page, $config);
    $tidy->cleanRepair();
    $page = (string) $tidy;
    $page = str_replace(“>\n”, ‘>’, $page);

  18. Chris said

    Fixed a problem in my previous code, and turned it into a function.

    function compact_html($page)
    {
    $tidy_config = array(
    ‘clean’ => 1,
    ‘bare’ => 1,
    ‘hide-comments’ => 1,
    ‘indent-spaces’ => 0,
    ‘tab-size’ => 1,
    ‘wrap’ => 0,
    ‘preserve-entities’ => 1,
    ‘indent’ => 0,
    ‘break-before-br’ => 0,
    ‘output-xhtml’ => 1,
    );

    $tidy = new tidy;
    $tidy->parseString($page, $tidy_config);
    $tidy->cleanRepair();
    // Fold any new lines that are surrounded by an close then open tag.
    return str_replace(“>\n<’, $tidy);
    }

  19. buggedcom said

    you may want to check out this set of functions I’ve created for whitespace compression. It’s not always easy to get tidy installed on some servers. Also not the javascript compression used does have some limitations.


    <?php

    /**
    * whitespace
    *
    * Created by Oliver Lillie on 2007-08-25.
    * Copyright (c) 2007 Buggedcom. All rights reserved.
    */

    /**
    * Strips out whitespace from the buffer to the require compression level.
    *
    * @param string $buffer
    * @param int $compression_level 1 = only compress javascript and css tags, 2 sames as 1 but also compress horizontal whitespace, 3 same as 2 but also compresses vertical whitespace whilst preserving textarea and pre tags
    * @return string
    */
    function stripHTMLWhiteSpace($buffer, $compression_level=3)
    {
    switch($compression_level)
    {
    case 3 :
    $buffer = compressHorizontally($buffer);
    $buffer = compressVertically($buffer[0], $buffer[1]);
    $buffer = compressScriptAndStyleTags($buffer[0]);
    break;
    case 2 :
    $buffer = compressHorizontally($buffer);
    $buffer = compressScriptAndStyleTags($buffer[0]);
    break;
    case 1 :
    $buffer = compressScriptAndStyleTags($buffer);
    break;
    }
    return $buffer;
    }

    /**
    * Compresses white space horizontally (ie spaces, tabs etc) whilst preserving
    * textarea and pre content.
    * Idea and partial code borrowed from smarty.
    * http://smarty.php.net/contribs/plugins/view.php/outputfilter.trimwhitespace.php
    *
    * @param string $data
    * @return string
    */
    //
    function compressHorizontally($data, $preserved_blocks=false)
    {
    $reinstate = true;
    if(!$preserved_blocks)
    {
    $reinstate = false;
    // get the textarea matches
    preg_match_all("!]*>.*?!is", $data, $preserved_area_match);
    $preserved_blocks = $preserved_area_match[0];
    // replace the textareas inerds with markers
    $data = preg_replace("!]*>.*?!is", '@@@HTMLCOMPRESSION@@@', $data);
    }
    // remove the white space
    $data = preg_replace('/((?)\n)[\s]+/m', '\1', $data);
    // reinsert the textareas inners
    if($reinstate)
    {
    foreach($preserved_blocks as $curr_block)
    {
    $data = preg_replace("!@@@HTMLCOMPRESSION@@@!", $curr_block, $data, 1);
    }
    }
    return array($data, $preserved_blocks);
    }

    /**
    * Compresses white space vertically (ie line breaks) whilst preserving
    * textarea and pre content.
    *
    * @param string $data
    * @param mixed $textarea_blocks false if no textarea blocks have already been taken out, otherwise an array.
    * @return unknown
    */
    function compressVertically($data, $preserved_blocks=false)
    {
    $reinstate = true;
    if(!$preserved_blocks)
    {
    $reinstate = false;
    // get the textarea matches
    preg_match_all("!]*>.*?!is", $data, $preserved_area_match);
    $preserved_blocks = $preserved_area_match[0];
    // replace the textareas inerds with markers
    $data = preg_replace("!]*>.*?!is", '@@@HTMLCOMPRESSION@@@', $data);
    }
    $data = str_replace("\n", '', $data);
    // reinsert the textareas inerds
    if($reinstate)
    {
    foreach($preserved_blocks as $curr_block)
    {
    $data = preg_replace("!@@@HTMLCOMPRESSION@@@!", $curr_block, $data, 1);
    }
    }
    return array($data, $preserved_blocks);
    }

    /**
    * Compresses code (ie javascript and css) whitespace.
    *
    * @param string $code
    * @return string
    */
    function compressCode($code)
    {
    // Remove multiline comment
    $mlcomment = '/\/\*(?!-)[\x00-\xff]*?\*\//';
    $code = preg_replace($mlcomment,"",$code);
    // Remove single line comment
    $slcomment = '/[^:]\/\/.*/';
    $code = preg_replace($slcomment,"",$code);
    // Remove extra spaces
    $extra_space = '/\s+/';
    $code = preg_replace($extra_space," ",$code);
    // Remove spaces that can be removed
    $removable_space = '/\s?([\{\};\=\(\)\\\/\+\*-])\s?/';
    $code = preg_replace('/\s?([\{\};\=\(\)\/\+\*-])\s?/',"\\1",$code);
    return $code;
    }

    /**
    * Compresses the white space within script and style tags.
    *
    * @param string $data
    * @return string
    */
    function compressScriptAndStyleTags($data)
    {
    // pregmatch all the script tags
    $scripts = preg_match_all("!(]*>(?:\\s*\\s*)?)!is", $data, $scriptparts);
    // collect and compress the parts
    $compressed = array();
    $parts = array();
    for($i=0; $i<count($scriptparts[0]); $i++)
    {
    array_push($parts, $scriptparts[0][$i]);
    array_push($compressed, compressCode($scriptparts[0][$i]));
    }
    // do the replacements and return
    return str_replace($parts, $compressed, $data);
    }

    <?php

    /**
    * comments
    *
    * Created by Oliver Lillie on 2007-08-25.
    * Copyright (c) 2007 Buggedcom. All rights reserved.
    */

    /**
    * Strips HTML Comments from the buffer whilst making a check to see if
    * Inernet Explorer conditional comments should be stripped or not.
    *
    * @param string $buffer
    * @return string
    */
    function stripHTMLComments($buffer)
    {
    // check that the opening browser is internet explorer
    $msie = '/msie\s(.*).*(win)/i';
    $keep_conditionals = (isset($_SERVER['HTTP_USER_AGENT']) && preg_match($msie, $_SERVER['HTTP_USER_AGENT']));
    // $keep_doctype = false;
    // if(strpos($buffer, '<!DOCTYPE'))
    // {
    // $buffer = str_replace('<!DOCTYPE', '--**@@DOCTYPE@@**--', $buffer);
    // $keep_doctype = true;
    // }
    // ie conditionals are to be kept so substitute
    if($keep_conditionals)
    {
    $buffer = str_replace(array('<!--[if', ''), array('--**@@IECOND-OPEN@@**--', '--**@@IECOND-CLOSE@@**--'), $buffer);
    }
    // remove comments
    $buffer = preg_replace('//', '', $buffer);
    // $buffer = preg_replace ('@@', '', $buffer);
    // re sub-in the conditionals if required.
    if($keep_conditionals)
    {
    $buffer = str_replace(array('--**@@IECOND-OPEN@@**--', '--**@@IECOND-CLOSE@@**--'), array('<!--[if', ''), $buffer);
    }
    // if($keep_doctype)
    // {
    // $buffer = str_replace('--**@@DOCTYPE@@**--', '<!DOCTYPE', $buffer);
    // }
    // return the buffer
    return $buffer;
    }

    usage

    $text = 'etc etc html content ';
    $text = stripHTMLComments($text);
    $text = stripHTMLWhiteSpace($text, 3);
    echo $text;

  20. buggedcom said

    there is some miscellaneous php tags in there that need removing.

  21. Deepak said

    buggedcom

    How to configure your script for buffering etc from the php server side?

  22. Srijas said

    The htmLawed PHP script can compact HTML. htmLawed is also an HTML filter and an alternative to using HTML Tidy; no need for external library or extension to PHP.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: