Detailed explanation of data collection in PHP_PHP tutorial-PHP Tutorial-php.cn

Here are two good tools that can be used for PHP collection. One is Snoopy and the other is simple_html_dom. There are many ways to collect (in fact, there are only 2-3 in essence, and the others are derivatives). PHP comes with several methods that can also be used to collect directly. But, in the spirit of carrying laziness through to the end. We can still use these two tools to make collection easier.

There are many introductions to Snoopy on the Internet. The following is Snoopy’s SDK translated by others
/////////////////// ////////////////////////////////////////////
Snoopy is a php Class, used to simulate the functions of the browser, can obtain web content and send forms.
Some features of Snoopy:
1 Fetch the content of the web page fetch
2 Fetch the text content of the web page (remove HTML tags) fetchtext
3 Fetch links to web pages, form fetchlinks fetchform
4 supports proxy host
5 supports basic username/password verification
6 supports setting user_agent, referer (source), cookies and header content (header file)
7 supports browser redirection and can control redirection depth
8 can expand links in web pages into high-quality URLs (default)
9 submit data and obtain return values
10 support Tracking HTML framework
11 supports passing cookies when redirecting
PHP 4 or above is required. Since it is a PHP class, it does not need to be expanded. It is the best choice when the server does not support curl.
class method :
fetch($URI)
————–
This is the method used to fetch the content of the web page.
The $URI parameter is the URL address of the crawled web page.
The fetched results are stored in $this->results.
If you are scraping a frame, Snoopy will track each frame and store it in an array, and then store it in $this->results.
fetchtext($URI)
————
This method is similar to fetch(). The only difference is that this method will remove HTML tags and other irrelevant data and only return the text content in the web page. .
fetchform($URI)
————
This method is similar to fetch(). The only difference is that this method will remove HTML tags and other irrelevant data, and only return the form content in the web page ( form).
fetchlinks($URI)
————-
This method is similar to fetch(). The only difference is that this method will remove HTML tags and other irrelevant data, and only return the links in the web page ( link).
By default, relative links will be automatically completed and converted into full URLs.
submit($URI,$formvars)
——————-
This method sends a confirmation form to the link address specified by $URL. $formvars is an array that stores form parameters.
submittext($URI,$formvars)
————————–
This method is similar to submit(). The only difference is that this method will remove HTML tags and other irrelevant data. Only the text content in the web page after login is returned.
submitlinks($URI)
————-
This method is similar to submit(). The only difference is that this method will remove HTML tags and other irrelevant data, and only return the links in the web page ( link).
By default, relative links will be automatically completed and converted into full URLs.
Class attributes: (Default values are in brackets)
$host The connected host
$port The connected port
$proxy_host The proxy used Host, if any
$proxy_port The proxy host port used, if any
$agent User agent camouflage (Snoopy v0.1)
$referer source information, if any
$cookies cookies, if any
$rawheaders other header information, if any
$maxredirs maximum number of redirects, 0=not allowed (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks Whether to complete all links to complete addresses (true)
$user authentication user name, if any
$pass authentication user name, if any
$accept http accept type (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
$error Where to report the error, if any
$response_code from Response code returned by the server
$headers Header information returned from the server
$maxlength Maximum returned data length
$read_timeout Read operation timeout (requires PHP 4 Beta 4+)
Set to 0 No timeout
$timed_out If a read operation times out, this attribute returns true (requires PHP 4 Beta 4+)
$maxframes The maximum number of frames allowed to be tracked
$status The status of the captured http
$temp_dir The temporary file directory (/tmp) that the web server can write to
$curl_path The directory of the cURL binary. If there is no cURL binary, set it to false
The following is the demo

Copy code The code is as follows:

include "Snoopy.class.php";
$snoopy = new Snoopy;
$snoopy->proxy_host = "www.7767.cn";
$snoopy->proxy_port = "8080";
$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";
$snoopy->referer = "http://www.7767.cn/";
$snoopy->cookies["SessionID"] = 238472834723489l;
$snoopy->cookies["favoriteColor"] = "RED";
$snoopy->rawheaders["Pragma"] = "no-cache";
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = false;
$snoopy->user = "joe";
$snoopy->pass = "bloe";
if($snoopy->fetchtext("http://www.7767.cn"))
{
echo "

".htmlspecialchars($snoopy->results)."

\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";

//////////////////////////////////////////////////////////////
Snoopy的特点是“大”和“全”，一个fetch什么都采到了，可以作为采集的第一步。接下来就需要用simple_html_dom来细细的把想要的部分，扣出来。当然，如果你特别特别擅长正则，而且又钟爱正则，你也可以用正则去匹配抓取。

simple_html_dom其实是一个dom解析的过程。php内部也提供了一些解析的方法，但是这个simple_html_dom可以说做得比较专业，一个类，满足了很多你想要的功能。
////////////////////////////////////////////////////////////////
// 用一个URL或文件名，创建一个目标文档对象，也就是目标网页
$html = file_get_html ('http://www.7767.cn/' );
//$html = file_get_html ('test.htm' );
//用一个字符串作为一个目标网页。你可以通过Snoopy获取页面，然后再拿到这里来处理
$myhtml = str_get_html ('Hello!' );
// 找到所有的图片，返回的是数组
foreach($html->find ('img' ) as $element)
echo $element->src . '
' ;
// 找到所有的链接
foreach($html->find ('a' ) as $element)
echo $element->href . '
' ;

find方法很好用，通常它返回的是一个包含对象的数组。查找目标元素的时候可以通过class或者id，或者其他属性获取目标字符串。

//通过目标div的class属性，查找div，find方法中第二个参数是返回的那个数组中的第几个。从0开始是第一个
$target_div = $html->find ('div.targetclass',0 );
//查看结果是否是你想要的，直接echo就可以了
echo $target_div;

//比较关键的一点是，这个采集对象创建完后，一定要销毁掉，否则php页面有可能会“卡”上30秒左右，这个取决于你服务器的那个时间限制。销毁的方法是：
$html->clear();
unset($html);
本人认为simple_html_dom比较优秀的地方就是，把采集控制得像JS一样容易。在下面提供的下载包中有英文的手册
simplehtmldom_1_11/simplehtmldom/manual/manual.htm

array $e->getAllAttributes ()	array $e->attr
string $e->getAttribute ( $name )	string $e->attribute
void $e->setAttribute ( $name, $value )	void $value = $e->attribute
bool $e->hasAttribute ( $name )	bool isset($e->attribute )
void $e->removeAttribute ( $name )	void $e->attribute = null
element $e->getElementById ( $id )	mixed $e->find ( "#$id", 0 )
mixed $e->getElementsById ( $id [,$index] )	mixed $e->find ( "#$id" [, int $index] )
element $e->getElementByTagName ($name )	mixed $e->find ( $name, 0 )
mixed $e->getElementsByTagName ( $name [, $index] )	mixed $e->find ( $name [, int $index] )
element $e->parentNode ()	element $e->parent ()
mixed $e->childNodes ( [$index] )	mixed $e->children ( [int $index] )
element $e->firstChild ()	element $e->first_child ()
element $e->lastChild ()	element $e->last_child ()
element $e->nextSibling ()	element $e->next_sibling ()
element $e->previousSibling ()	element $e->prev_sibling ()