Do programmers still read novels with advertisements?-PHP Tutorial-php.cn

Some people are used to reading novels, and occasionally read a few chapters. They are all published by Baidu, but there are basically very annoying advertisements. Either add links to the overall div, and if they are accidentally touched, they will jump to some websites or even an endless loop. Some mobile apps also have a lot of ads, so I have nothing to do but write a small program to avoid the annoyance of ads

This article will usephp curl to collect the page simple_html_dom parsingto achieve true removal of ads.

Look for a book on any novel website, but this site is particularly tricky on mobile phones because of the above problems:

Do programmers still read novels with advertisements?

Just take this This novel will do the surgery. (Disclaimer: This is definitely not promotion, infringement or deletion)

1. Understand the get method of curl

curl is a command line tool that uploads or downloads through the specified URL data and display the data. The c in curl means client, and URL is the URL.

Using cURL in PHP can implement Get and Post request methods

A simple crawl of novels only requires the get method.

The following sample code is an example of obtaining the html of the first chapter novel page through a get request. You only need to change the url parameters.

Initialization, setting options, certificate verification, execution, closing

Copy after login

The comments are particularly detailed. Follow the steps to send a curl get request. If it is a post request, then You need to add an additional setting to set the post option, pass parameters, and finally output the obtained information. The running results are as follows, there is no css rendering.

Do programmers still read novels with advertisements?

2. Parse the page

The output page has a lot of unnecessary content and needs to be extracted from all the content To get the content we need, such as the title and the content of each chapter, we need to parse the page.

There are many ways to parse the page. Simple_html_dom is used here. You need to download and reference the simple_html_dom.php class, instance object, and call the internal method. For specific methods, you can check the official website or other documents on the Chinese website.

First analyze the source code of this novel page and look at the elements corresponding to the title and content of this chapter

The first is the title: under h1 under the class bookname

Do programmers still read novels with advertisements?

Then the content: Under the div with the id of content,

Do programmers still read novels with advertisements?

simple_html_dom can use the find method, similar to jquery. The selector finds the positioned element. For example:

find('.bookname h1'); //Find the h1 title element under class bookname

find('#content'); //Find The content of the chapter with the id of content

The code is added based on the above:

include "simple_html_dom.php"; $html = new simple_html_dom(); @$html->load($res); $h1 = $html->find('.bookname h1'); foreach ($h1 as $k=>$v) { $artic['title'] = $v->innertext; } // 查找小说的具体内容 $divs = $html->find('#content'); foreach ($divs as $k=>$v) { $content = $v->innertext; } // 正则替换去除多余部分 $pattern = "/(.*?<\/p>)|(
.*?<\/div>)/"; $artic['content'] = preg_replace($pattern,'',$content); echo $artic['title'].'
'; echo $artic['content'];

Copy after login

The content obtained by using the above parsing method is an array, use foreach To obtain the content of the array, regular replacement is used to remove the text advertisements in the text, and the title and novel content are placed in the array. The simplest way to write it is done. The running result is as follows:

Do programmers still read novels with advertisements?

# Of course, this way of writing looks uncomfortable, you can encapsulate the function class yourself. The following is a code example I wrote myself. Of course, there are definitely deficiencies, but it can be used as a reference for expansion.

'; echo $artic['content']; /** * 获取www.7kzw.com 获取每一章的页面html * @param type $num 第几章,从第一开始(int) * @return 返回字符串 */ function get_html($num){ $start = 27248636; $real_num = $num+$start-1; $url = 'https://www.7kzw.com/85/85445/'.$real_num.'.html'; $header = [ 'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0' ]; return mySpClass()->getCurl($url,$header); } /** * 获取www.7kzw.com小说标题数组 * @param type $get_html 得到的每一章的页面html * @return 返回$artic数组,['title'=>'','content'=>''] */ function getContent($get_html){ $html = new simple_html_dom(); @$html->load($get_html); $h1 = $html->find('.bookname h1'); foreach ($h1 as $k=>$v) { $artic['title'] = $v->innertext; } // 查找小说的具体内容 $divs = $html->find('#content'); foreach ($divs as $k=>$v) { $content = $v->innertext; } // 正则替换去除多余部分 $pattern = "/(.*?<\/p>)|(
.*?<\/div>)/"; $artic['content'] = preg_replace($pattern,'',$content); return $artic; } ?>

Copy after login

The final running result of the above example code: enter the number in the chapter and pass the parameters through $_GET['n']

Do programmers still read novels with advertisements?

Summary:

Knowledge points: curl (tips:curl module collects any web page php class), regular, parsing tool simple_html_dom

Although the writing method has been initially improved , but it is best to deploy your own server to achieve the best results. Otherwise, you can only watch it on a computer, which is not very convenient. You may be more willing to tolerate advertisements.

The above are the details of using php curl to collect pages and using simple_html_dom to parse them. For more information, please pay attention to other related articles on the php Chinese website!

The above is the detailed content of Do programmers still read novels with advertisements?. For more information, please follow other related articles on the PHP Chinese website!