Home > php教程 > PHP源码 > PHP压缩HTML-使用正则安全的压缩

PHP压缩HTML-使用正则安全的压缩

PHP中文网
Release: 2016-05-25 17:11:07
Original
1102 people have browsed it

分析HTML代码

HTML一般包括以下的部分:

标签,以<>包裹

文档声明、注释等格式

半开标签<..../>

闭合标签

内容,非以<>包裹,这一部分内容中多空格是无效果的

PHP正则分解html代码

html代码是以标签为界限的,所以,只要按标签分解就可以了。在PHP中使用preg_split:

$segments = preg_split("/(<[^>]+?>)/si",$html, null,PREG_SPLIT_NO_EMPTY| PREG_SPLIT_DELIM_CAPTURE);
Copy after login

例如,以下的代码

$html = <<<HTML
<!doctype html>
<html>
<head>
    <title>狼魂博客</title>
    <meta charset="utf-8">
    <meta name="description" content="关注WEB,体悟生活;珍惜生命,远离代码。">
</head>
<body>
<div class="wrap">
    <a class="rss" href="http://pjiaxu.com/rss.xml">文章RSS</a>
    <a class="rss" href="http://pjiaxu.com/map.html">网站地图</a>
    <a class="rss" href="http://pjiaxu.com/archives.html">日期归档</a>
    <a class="rss" href="http://pjiaxu.com/tags.html">标签归档</a>
    <h2 id="logo"><a href="http://pjiaxu.com/" title="狼魂博客">狼魂博客</a></h2>
    <p id="webdesc">关注WEB,体悟生活;珍惜生命,远离代码。</p>
    <div class="clear"></div>
</div>
</body>
</html>
HTML;
print_r(preg_split("/(<[^>]+?>)/si",$html, -1,PREG_SPLIT_NO_EMPTY| PREG_SPLIT_DELIM_CAPTURE));
Copy after login

会产生如下的输出:

Array
(
    [0] => <!doctype html>
    [1] => 
    [2] => <html>
    [3] => 
    [4] => <head>
    [5] => 
    [6] => <title>
    [7] => 狼魂博客
    [8] => </title>
    [9] => 
    [10] => <meta charset="utf-8">
    [11] => 
    [12] => <meta name="description" content="关注WEB,体悟生活;珍惜生命,远离代码。">
    [13] => 
    [14] => </head>
    [15] =>
    [16] => <body>
    [17] => 
    [18] => <div class="wrap">
    [19] => 
    [20] => <a class="rss" href="http://pjiaxu.com/rss.xml">
    [21] => 文章RSS
    [22] => </a>
    [23] => 
    [24] => <a class="rss" href="http://pjiaxu.com/map.html">
    [25] => 网站地图
    [26] => </a>
    [27] => 
    [28] => <a class="rss" href="http://pjiaxu.com/archives.html">
    [29] => 日期归档
    [30] => </a>
    [31] => 
    [32] => <a class="rss" href="http://pjiaxu.com/tags.html">
    [33] => 标签归档
    [34] => </a>
    [35] => 
    [36] => <h2 id="logo">
    [37] => <a href="http://pjiaxu.com/" title="狼魂博客">
    [38] => 狼魂博客
    [39] => </a>
    [40] => </h2>
    [41] => 
    [42] => <p id="webdesc">
    [43] => 关注WEB,体悟生活;珍惜生命,远离代码。
    [44] => </p>
    [45] => 
    [46] => <div class="clear">
    [47] => </div>
    [48] => 
    [49] => </div>
    [50] => 
    [51] => </body>
    [52] => 
    [53] => </html>
)
Copy after login

最简单但有损的PHP压缩

最简单的压缩就是直接连接所有的非空项,同时非标签去掉所有的空白:

$compressed = array();
foreach($segments as $seg)
{
    $seg = trim($seg);
    if($seg)
    {
        //非标签中的空白是无效的字符
        $compressed[] = $seg[0] === &#39;<&#39; ? $seg : preg_replace(&#39;!\s!&#39;,&#39;&#39;,$seg);
    }
}
return join(&#39;&#39;,$compress);
Copy after login

如以上的HTML通过这样的压缩生成的代码是(为了方便显示,我手动将它们断行了):

<!doctype html><html><head><title>狼魂博客</title><meta charset="utf-8">
<meta name="description" content="关注WEB,体悟生活;珍惜生命,远离代码。"></head>
<body><div class="wrap"><a class="rss" href="http://pjiaxu.com/rss.xml">文章RSS</a>
<a class="rss" href="http://pjiaxu.com/map.html">网站地图</a>
<a class="rss" href="http://pjiaxu.com/archives.html">日期归档</a>
<a class="rss" href="http://pjiaxu.com/tags.html">标签归档</a><h2 id="logo">
<a href="http://pjiaxu.com/" title="狼魂博客">狼魂博客</a></h2>
<p id="webdesc">关注WEB,体悟生活;珍惜生命,远离代码。</p><div class="clear">
</div></div></body></html>
Copy after login

正常情况下这没错,但是也有“不正常”的情况:遇到不能去掉空白的内容时。比如script、code、pre、style是不可以去掉空白的,这时,就要使用栈进行压缩了:

使用栈进行安全压缩html

使用栈的规则是:<..>标签入栈,标签出栈,和<../>不理,但有一种可能的情况是,<../>不一定有结尾的反斜杠如:

<meta>
<meta/>
Copy after login

都是可以的,这时就要特殊的处理这种情况:

<?php
$html = <<<HTML
<body>
<div class="wrap">
    <a class="rss" href="http://pjiaxu.com/rss.xml">文章RSS</a>
    <a class="rss" href="http://pjiaxu.com/map.html">网站地图</a>
    <a class="rss" href="http://pjiaxu.com/archives.html">日期归档</a>
    <a class="rss" href="http://pjiaxu.com/tags.html">标签归档</a>
    <h2 id="logo"><a href="http://pjiaxu.com/" title="狼魂博客">狼魂博客</a></h2>
    <p id="webdesc">关注WEB,体悟生活;珍惜生命,远离代码。</p>
    <div class="clear"></div>
</div>
<pre class="brush:php;toolbar:false">
    var say = "Hello world!";
    print say;
HTML; $segments = preg_split("/(<[^>]+?>)/si",$html, -1,PREG_SPLIT_NO_EMPTY| PREG_SPLIT_DELIM_CAPTURE); $compressed = array(); $stack = array(); $tag = ''; $half_open = array('meta','input','link','img','br'); $cannot_compress = array('pre','code','script','style'); foreach($segments as $seg) { if(trim($seg) === '') { continue; } //<.../> if(preg_match("!<([a-z0-9]+)[^>]*?/>!si",$seg, $match)) { //$tag = self::format_tag($match[1]); format_tag($match[1]); $compressed[] = $seg; } else if(preg_match("!]*?>!si",$seg,$match))// { $tag = format_tag($match[1]); if(count($stack) > 0 && $stack[count($stack)-1] == $tag) { array_pop($stack); $compressed[] = $seg; } //这里再最好加一段判断,可以用于修复错误的html //... } else if(preg_match("!<([a-z0-9]+)[^>]*?>!si",$seg,$match))//<> { $tag = format_tag($match[1]); //半闭合标签不需要入栈,如
, if(!in_array($tag, $half_open)) { array_push($stack,$tag); } $compressed[] = $seg; } else if(preg_match("~]*>~", $seg)) { //文档声明和注释,注释也不能删除,如 $compressed[] = $seg; } else { $compressed[] = in_array($tag, $cannot_compress) ? $seg : preg_replace('!\s!', '', $seg); } } function format_tag($tag) { return trim(strtolower($tag)); } echo join('',$compressed);
Copy after login

以上的代码产生如下的输出(为了方便显示,我手工断行了):

<body><div class="wrap"><a class="rss" href="http://pjiaxu.com/rss.xml">文章RSS</a>
<a class="rss" href="http://pjiaxu.com/map.html">网站地图</a>
<a class="rss" href="http://pjiaxu.com/archives.html">日期归档</a>
<a class="rss" href="http://pjiaxu.com/tags.html">标签归档</a><h2 id="logo">
<a href="http://pjiaxu.com/" title="狼魂博客">狼魂博客</a></h2><p id="webdesc">
关注WEB,体悟生活;珍惜生命,远离代码。</p><div class="clear"></div></div><pre class="brush:php;toolbar:false">
    var say = "Hello world!";
    print say;
Copy after login

最安全的HTML压缩

当然,以上的代码正确运行的基础是:HTML都是正确合法的,没有出现如等,但是在普通应用是没有问题的,如果需要安全的压缩HTML代码,可以使用HTML解析库进行修复并进行压缩

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Recommendations
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template