In the previous article, PHP-based data collection and warehousing program (2) mentioned the collection of news information Page list data, let’s talk about the specific content of collecting news
This is the screenshot of the final data sheet from the previous blog:
The next step is to read the URL that needs to be collected from the database and crawl the page
Create a new content table
However, one thing to note is that you can no longer use the incrementing method of collecting URLs, because there may be id discontinuities in the data table, such as id=9, id=11. When the id=10 is collected, Sometimes, the URL is blank, which may result in empty fields being collected.
One of the techniques used here is the query statement of the database. When we collect the first piece of data, we determine whether there is an ID number greater than this ID in the database. If so, read one and repeat the query information above. work.
The specific code is as follows:
<?<span>php </span><span>include_once</span>("conn.php"<span>); </span><span>$id</span>=(int)<span>$_GET</span>['id'<span>]; </span><span>$sql</span>="select * from list where id=<span>$id</span>"<span>; </span><span>$result</span>=<span>mysql_query</span>(<span>$sql</span><span>); </span><span>$row</span>=<span>mysql_fetch_array</span>(<span>$result</span>);<span>//</span><span>取得对应的url地址</span> <span>$content</span>=<span>file_get_contents</span>(<span>$row</span>['url'<span>]); </span><span>$pattern</span>="/<dd class=\"dataWrap\">(.*)<\/dd>/iUs"<span>; </span><span>preg_match</span>(<span>$pattern</span>, <span>$content</span>,<span>$info</span>);<span>//</span><span>获取内容存放info</span> <span>echo</span> <span>$title</span>=<span>$row</span>[1]."<br/>"<span>; </span><span>echo</span> <span>$content</span>=<span>$info</span>[0]."<hr/>"<span>; </span><span>//</span><span>插入数据库</span> <span>$add</span>="insert into content(title,content) value('<span>$title</span>','<span>$content</span>')"<span>; </span><span>mysql_query</span>(<span>$add</span><span>); </span><span>$sql2</span>="select * from list where id><span>$id</span> order by id asc limit 1"<span>; </span><span>$result2</span>=<span>mysql_query</span>(<span>$sql2</span><span>); </span><span>$row2</span>=<span>mysql_fetch_array</span>(<span>$result2</span>);<span>//</span><span>取得对应的url地址</span> <span>if</span>(<span>$row2</span>['id'<span>]){ </span><span>echo</span> "<script>window.location='content.php?id=<span>$row2</span>[0]'</script>"<span>; } </span>?>
In this way, the news content we want has been collected and stored in the database. Next, we only need to organize some styles of the data.
Common technical essentials for PHP data collection:
1. Proficient in regular expression data extraction technology: key steps for extracting content
2. Proficient in character encoding conversion analysis technology: compatibility management and data validity control
3. Proficient in data storage and storage technology: storage and management of collected content, including databases, files and progress
4. Data mining and website crawling technology: analyze website structure, simplify crawling techniques, and improve efficiency
5. Anti-anti-collection processing technology: Anti-anti-collection technology designed for target objects with anti-collection
6. Multi-server concurrent collection management technology: working methods to improve efficiency
7. Data sorting and analysis Technology: Check for leaks and verify the correctness and effectiveness of data
8. Self-identity protection technology: Protection of one’s own information
There is $nr = implode('#',$arr) method in php, that's it
But the above composition is "Content 1# Content 2" without the last #, if necessary
That’s $nr = implode('#',$arr).'#'
The stupid way is to use
foreach( $arr as $vl){
$nr .=$vl."#";
}
Reference: $