A few days ago, a friend asked me to help create a program for collecting news information. I took some time to write a PHP version and recorded it in my notes.
Speaking of collection, it is nothing more than obtaining information remotely ->Extracting required content->Classified storage->Reading->Display
It can also be regarded as a simple "thief program" Enhanced version
The following is the corresponding core code (don’t use it to do bad things^_^)
The content to be collected is an announcement on a game website, as shown below:
You can first use file_get_contents and simple regular expressions to obtain basic page information
Organize the basic information and collect it into the database:
<?php include_once("conn.php"); if($_GET['id']<=8&&$_GET['id']){ $id=$_GET['id']; $conn=file_get_contents("http://www.93moli.com/news_list_4_$id.html");//获取页面内容 $pattern="/<li><a title=\"(.*)\" target=\"_blank\" href=\"(.*)\">/iUs";//正则 preg_match_all($pattern, $conn, $arr);//匹配内容到arr数组 //print_r($arr);die; foreach ($arr[1] as $key => $value) {//二维数组[2]对应id和[1]刚好一样,利用起key $url="http://www.93moli.com/".$arr[2][$key]; $sql="insert into list(title,url) value ('$value', '$url')"; mysql_query($sql); //echo "<a href='content.php?url=http://www.93moli.com/$url'>$value</a>"."<br/>"; } $id++; echo "正在采集URL数据列表$id...请稍后..."; echo "<script>window.location='list.php?id=$id'</script>"; }else{ echo "采集数据结束。"; } ?>
conn.php is the database connection file
list.php is this page
Since the data to be collected is displayed in pages, and the page address is increasing regularly, I use The js jump code is eliminated, and the number of collected pages is controlled by passing the id value, which also avoids the number of for loops being too large.
Easily enter data into the database. The next article will be about the process of collecting information from specific URLs.