Home>Article>Web Front-end> Use nodejs to implement a simple web crawler function (code attached)

Use nodejs to implement a simple web crawler function (code attached)

青灯夜游
青灯夜游 forward
2021-02-19 17:41:38 4266browse

This article will introduce you tonodejsthrough examples how to implement a simple web crawler function. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to everyone.

Use nodejs to implement a simple web crawler function (code attached)

Related recommendations: "nodejs tutorial"

Webpage source code

Use the http.get() method to obtain Web page source code, taking the headline page of the hao123 website as an example

http://tuijian.hao123.com/hotrank

var http = require('http'); http.get('http://tuijian.hao123.com/hotrank',function(res){ var data = ''; res.on('data',function(chunk){ data += chunk; }); res.on('end',function(){ console.log(data); }) });

The results obtained are as follows :

          热点排行榜-头条新闻-hao123新闻导航_hao123上网导航     

电视剧

排名关键词搜索指数

综艺

排名关键词搜索指数

Filter data

Based on the variety show hot spots on the web page

The relevant source code is as follows

Through analysis, it can be seen that the 'variety show' module and other modules are located in 047a9493b6e89a249e061993420f0f18. Among them, monkey='zy' in the inner layer p of the variety show module, variety show The information of the 10 variety shows in the module is located in 9c3e0542e9f0b874b88eef8a0c906b59, and the name of the variety show is located in e4ee1e0cf77d7bbc16717e6456da9163

cheerio

How do we get useful data from the source code? First, nodeJS does not support document objects. If you want to use a stupid method, you can only use regular expressions to process

cheerio is specially customized by nodejs for the server side, and can quickly and flexibly implement the JQuery core. It works on the DOM model, and is very efficient in parsing, operating and rendering

【Installation】

【Use】

It The usage method is quite similar to jQuery, and it is very easy to get started. Take the names of the top 10 popular variety shows as an example

var http = require('http'); var cheerio = require('cheerio'); http.get('http://tuijian.hao123.com/hotrank',function(res){ var data = ''; res.on('data',function(chunk){ data += chunk; }); res.on('end',function(){ filter(data); }) }); function filter(data){ //保存搜索量前10的综艺节目标题 var result = []; //将页面源代码转换为$对象 var $ = cheerio.load(data); //查找每个综艺节目标题的外层div var temp_arr = $('[monkey = "zy"]').find('.point-bd').find('.point-title'); //将综艺节目标题依次保存到结果数组中 temp_arr.each(function(index,item){ result.push($(item).text()); }) //[ '变形计','来吧冠军','拜托了冰箱','昆仑决','天生是优我','姐姐好饿','脑力男人时代','奔跑吧兄弟','我想和你唱','玫瑰之旅' ] console.log(result); }

Crawler code

The following is the 'real-time hot spots', 'today's hot spots', 'people's livelihood hot spots', and 'movies' in the hao123 webpage The rankings of the six parts of ', 'TV Series' and 'Variety Shows' are climbed down and placed in the array in the object named 'result'. The commands are 'ss', 'jr', 'ms', 'dy', ' dsj', 'zy'

[The code is as follows]

var http = require('http'); var cheerio = require('cheerio'); http.get('http://tuijian.hao123.com/hotrank',function(res){ var data = ''; res.on('data',function(chunk){ data += chunk; }); res.on('end',function(){ filter(data); }) }); function filter(data){ //保存各部分搜索量前10的名称 //对象名为榜单名,如'实时热点' //对象内容为10个标题名称组成的数组 var result = {}; //将页面源代码转换为$对象 var $ = cheerio.load(data); //查找'实时热点'、'今日热点'、'民生热点'、'电影'、'电视剧'、'综艺'这6个榜单所在的div var temp_div = $('.top-wrap'); //保存榜单名称 var temp_title = []; temp_div.each(function(index,item){ //查找榜单名,并保存到temp_title文件夹中 temp_title.push($(item).find('h2').text()); //查找每类下每个标题的外层div var temp_arr = $(item).find('.point-bd').find('.point-title'); //将result下的每个榜单初始化为一个数组 var innerResult = result[temp_title[index]] = []; //将节目标题依次保存到相应榜单的数组中 temp_arr.each(function(_index,_item){ innerResult.push($(_item).text()) }) }) console.log(result); }

[The results are as follows]

{ '实时热点': [ '美国逮捕女斯诺登', '成都隐秘母乳买卖', '曝周杰伦青涩旧照', '老头公交强吻女孩', '王传君恋情曝光', '杭州现奇葩窗口', '忘带全班准考证', '未成年持械拍网红', '9秒揍儿子8拳', '戴耳机穿轨道被撞' ], '今日热点': [ '北京回龙观大火', '选美冠军车祸身亡', '2017高考', '成都老火锅店被查', '陈浩民娇妻秀身材', '海边直播发现浮尸', '曝印小天遭妻骗婚', '苹果开发者大会', '6万斤鱼缺氧死亡', '安以轩夏威夷大婚' ], '民生热点': [ '北京回龙观大火', '2017高考', '成都老火锅店被查', '海边直播发现浮尸', '苹果开发者大会', '6万斤鱼缺氧死亡', '北控外援训练猝死', '武汉男子裸体捅人', '多国与卡塔尔断交', '美驻华外交官辞职' ], '电影': [ '神奇女侠', '异星觉醒', '新木乃伊', '中国推销员', '荡寇风云', '异兽来袭', '李雷和韩梅梅', '北极星', '美好的意外', '夏天19岁的肖像' ], '电视剧': [ '龙珠传奇', '楚乔传', '欢乐颂2', '欢乐颂', '职场是个技术活', '择天记', '美食大冒险', '废柴兄弟', '人民的名义', '三生三世十里桃花' ], '综艺': [ '变形计', '来吧冠军', '拜托了冰箱', '昆仑决', '天生是优我', '姐姐好饿', '脑力男人时代', '奔跑吧兄弟', '我想和你唱', '玫瑰之旅' ] } [Finished in 0.7s]

For more programming-related knowledge, please Visit:Introduction to Programming! !

The above is the detailed content of Use nodejs to implement a simple web crawler function (code attached). For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:cnblogs.com. If there is any infringement, please contact admin@php.cn delete
Previous article:How to wrap lines in js Next article:How to wrap lines in js