To develop a crawler, first you need to know what your crawler is going to be used for. I want to use it to find articles with specific keywords on different websites and get their links so that I can read them quickly.
According to personal habits, I first need to write an interface and clarify my ideas.
1. Go to different websites. Then we need a url input box.
2. Find articles with specific keywords. Then we need an article title input box.
3. Get the article link. Then we need a display container for search results.
文章URL抓取文章URL
Go directly to the code, and then add some style adjustments of your own, and the interface is complete:
Then the next step is to implement the function. I use PHP to write it. The first step is to get the html of the website. Code, there are many ways to get the html code, I won’t introduce them one by one. Here I use curl to get it, and you can get the html code by passing in the website url:
private function get_html($url){ $ch = curl_init(); $timeout = 10; curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_ENCODING, 'gzip'); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $html = curl_exec($ch); return $html; }
Although you got the html code, you will soon know Encountered a problem, that is, the encoding problem, which may make your next step of matching in vain. Here we uniformly convert the obtained html content to utf8 encoding:
$coding = mb_detect_encoding($html); if ($coding != "UTF-8" || !mb_check_encoding($html, "UTF-8")) $html = mb_convert_encoding($html, 'utf-8', 'GBK,UTF-8,ASCII');
Get the html of the website and get the url of the article. Then the next step is to match all a tags under the web page, which requires the use of regular expressions. After many tests, we finally got a more reliable regular expression. No matter how complex the structure under the a tag is, as long as it is a tag Don’t miss it: (the most critical step)
$pattern = '|]*>(.*)|isU'; preg_match_all($pattern, $html, $matches);
The matching result is in $matches, which is probably a multi-dimensional group like this:
array(2) { [0]=> array(*) { [0]=> string(*) "完整的a标签" . . . } [1]=> array(*) { [0]=> string(*) "与上面下标相对应的a标签中的内容" } }
As long as you can get this data, everything else is completely operable, you can Traverse this element group, find the a tag you want, and then get the corresponding attributes of the a tag. You can operate it however you want. Here is a recommended class to make it easier for you to operate the a tag:
$dom = new DOMDocument(); @$dom->loadHTML($a);//$a是上面得到的一些a标签 $url = new DOMXPath($dom); $hrefs = $url->evaluate('//a'); for ($i = 0; $i < $hrefs->length; $i++) { $href = $hrefs->item($i); $url = $href->getAttribute('href'); //这里获取a标签的href属性 }
Of course, this is just one method method, you can also use regular expressions to match the information you want and play new tricks with the data.
Get and match the results you want. The next step is of course to send them back to the front end to display them. Write the interface, then use js to get the data on the front end, and use jquery to dynamically add content and display it:
var website_url = '你的接口地址'; $.getJSON(website_url,function(data){ if(data){ if(data.text == ''){ $('#article_url').html(''); return; } var string = ''; var list = data.text; for (var j in list) { var content = list[j].url_content; for (var i in content) { if (content[i].title != '') { string += '暂无该文章链接
' + '[' + list[j].website.web_name + ']' + '' + content[i].title + '' + ''; } } } $('#article_url').html(string); });
Up Final rendering: