Community Learn Tools Library Leisure

English

Home > Web Front-end > HTML Tutorial > Node做网页爬虫时遇到的Html entites对象造成乱码_html/css_WEB-ITnose

Node做网页爬虫时遇到的Html entites对象造成乱码_html/css_WEB-ITnose

WBOY

Release： 2016-06-24 11:28:22

Original

1579 people have browsed it

Node做网页爬虫时遇到的Html entites对象造成乱码

就是文字内容是这种货：

��һҳ

尝试用iconv-lite模块的decode去转码，但是失败了。

这种叫HTML Entities，可以借助一些模块来转换，比如，html-entities Github。

HTML Entities是什么请参照如下网址：

http://www.w3school.com.cn/html/html_entities.asp

html-entities的使用方法如下

var Entities = require('html-entities').XmlEntities;entities = new Entities();var str = '&#xFFFD;&#xFFFD;&#x4BB;&#x4B3;';console.log(entities.decode(str));

Copy after login

在爬虫的请求上也要调整：

1 var headers = {  2   'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537.36'3 }

Copy after login

加上简单的伪装

使用Node爬的话，应该会用cheerio，在接受request返回的网页内容时，还是经过iconv的转换，再用cheerio

1 var html = iconv.decode(body, 'gbk')2 var $ = cheerio.load(html, {decodeEntities: false})

Copy after login

如果你不知道抓取的网页的编码的话，请使用:

res.headers['content-type']

根据返回的编码格式进行处理即可

关于网页内容转码和乱码的深层分析可以阅读如下博文：

http://www.dewen.io/q/13755

http://www.99css.com/nodejs-request-chinese-encoding/

这个帅哥的分析也很有趣

http://blog.vichamp.com/program/2015/07/04/Common-Messy-Code/

Related labels：

Node做网页爬虫时遇到的Html entites对象造成乱码

source：php.cn

Previous article：：关于使用jquery UI组件后页面被某个div遮罩导致超级连接无法点击的问题_html/css_WEB-ITnose Next article：electron之Windows下使用 html js css 开发桌面应用程序_html/css_WEB-ITnose

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Latest Articles by Author

What is a NullPointerException, and how do I fix it?

2024-10-22 09:46:29
From Novice to Coder: Your Journey Begins with C Fundamentals

2024-10-13 13:53:41
Unlocking Web Development with PHP: A Beginner's Guide

2024-10-12 12:15:51
Demystifying C: A Clear and Simple Path for New Programmers

2024-10-11 22:47:31
Unlock Your Coding Potential: C Programming for Absolute Beginners

2024-10-11 19:36:51
Unleash Your Inner Programmer: C for Absolute Beginners

2024-10-11 15:50:41
Automate Your Life with C: Scripts and Tools for Beginners

2024-10-11 15:07:41
PHP Made Easy: Your First Steps in Web Development

2024-10-11 14:21:21
Build Anything with Python: A Beginner's Guide to Unleashing Your Creativity

2024-10-11 12:59:11
The Key to Coding: Unlocking the Power of Python for Beginners

2024-10-11 12:17:31

Latest Issues

function_exists() cannot determine the custom function Function test () {return true;} if (function_exists ('test')) {echo "test is function...

From 2024-04-29 11:01:01

0

3

2065

How to display the mobile version of Google Chrome Hello teacher, how can I change Google Chrome into a mobile version?

From 2024-04-23 00:22:19

0

11

2229

The child window operates the parent window, but the output does not respond. The first two sentences are executable, but the last sentence cannot be implemented.

From 2024-04-19 15:37:47

0

1

1873

There is no output in the parent window document.onclick = function(){ window.opener.document.write('I am the output of the child ...

From 2024-04-18 23:52:34

0

1

1756

Where is the courseware about CSS mind mapping? Courseware

From 2024-04-16 10:10:18

0

0

1788

Related Topics

More>

Popular Recommendations

Popular Tutorials

More>

Related Tutorials

Popular Recommendations

Latest courses

Latest Downloads

More>

Web Effects

Website Source Code

Website Materials

Front End Template