Summary of the problem of garbled data captured by nodejs crawler

Summary of the problem of garbled data captured by nodejs crawler_node.js

WBOY

Release： 2016-05-16 15:51:42

Original

2307 people have browsed it

1. Non-UTF-8 page processing.

1. Background

windows-1251 encoding

For example, Russian website: https://vk.com/cciinniikk

Shameful to find this encoding

What we mainly talk about here is the issue of Windows-1251 (cp1251) encoding and utf-8 encoding. Others such as gbk will not be taken into consideration~

2. Solution

Use js native encoding conversion

But I haven’t found a way yet..

If it’s utf-8 to window-1251 it’s okayhttp://stackoverflow.com/questions/2696481/encoding-conversation-utf-8-to-1251-in-javascript

var DMap = {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99, 100: 100, 101: 101, 102: 102, 103: 103, 104: 104, 105: 105, 106: 106, 107: 107, 108: 108, 109: 109, 110: 110, 111: 111, 112: 112, 113: 113, 114: 114, 115: 115, 116: 116, 117: 117, 118: 118, 119: 119, 120: 120, 121: 121, 122: 122, 123: 123, 124: 124, 125: 125, 126: 126, 127: 127, 1027: 129, 8225: 135, 1046: 198, 8222: 132, 1047: 199, 1168: 165, 1048: 200, 1113: 154, 1049: 201, 1045: 197, 1050: 202, 1028: 170, 160: 160, 1040: 192, 1051: 203, 164: 164, 166: 166, 167: 167, 169: 169, 171: 171, 172: 172, 173: 173, 174: 174, 1053: 205, 176: 176, 177: 177, 1114: 156, 181: 181, 182: 182, 183: 183, 8221: 148, 187: 187, 1029: 189, 1056: 208, 1057: 209, 1058: 210, 8364: 136, 1112: 188, 1115: 158, 1059: 211, 1060: 212, 1030: 178, 1061: 213, 1062: 214, 1063: 215, 1116: 157, 1064: 216, 1065: 217, 1031: 175, 1066: 218, 1067: 219, 1068: 220, 1069: 221, 1070: 222, 1032: 163, 8226: 149, 1071: 223, 1072: 224, 8482: 153, 1073: 225, 8240: 137, 1118: 162, 1074: 226, 1110: 179, 8230: 133, 1075: 227, 1033: 138, 1076: 228, 1077: 229, 8211: 150, 1078: 230, 1119: 159, 1079: 231, 1042: 194, 1080: 232, 1034: 140, 1025: 168, 1081: 233, 1082: 234, 8212: 151, 1083: 235, 1169: 180, 1084: 236, 1052: 204, 1085: 237, 1035: 142, 1086: 238, 1087: 239, 1088: 240, 1089: 241, 1090: 242, 1036: 141, 1041: 193, 1091: 243, 1092: 244, 8224: 134, 1093: 245, 8470: 185, 1094: 246, 1054: 206, 1095: 247, 1096: 248, 8249: 139, 1097: 249, 1098: 250, 1044: 196, 1099: 251, 1111: 191, 1055: 207, 1100: 252, 1038: 161, 8220: 147, 1101: 253, 8250: 155, 1102: 254, 8216: 145, 1103: 255, 1043: 195, 1105: 184, 1039: 143, 1026: 128, 1106: 144, 8218: 130, 1107: 131, 8217: 146, 1108: 186, 1109: 190}

function UnicodeToWin1251(s) {
  var L = []
  for (var i=0; i<s.length; i++) {
    var ord = s.charCodeAt(i)
    if (!(ord in DMap))
      throw "Character "+s.charAt(i)+" isn't supported by win1251!"
    L.push(String.fromCharCode(DMap[ord]))
  }
  return L.join('')
}

Copy after login

Well, this is a good idea. What Dmap stores is actually the mapping relationship between window-1251 encoding and unicode

So I just planned to do it the other way around

But on the contrary, I discovered that the charCodeAt method is only valid for unicode. How to dig out the code segments of other encodings? Because I am using nodejs, I consider using the corresponding module

For instructions on installing and using the nodejs module iconv-lite, see https://www.npmjs.com/package/iconv-lite

According to the usage method, it should be used in a similar way

var iconv = require('iconv-lite');
var Buffer = require('buffer').Buffer;
// Convert from an encoded windows-1251 to utf-8
//这个str1应该是http.get 或request等请求返回的数据
//请求的时候要带参数，不然就会出错
//除了基本的参数之外 要注意记得使用 encoding: 'binary'这个参数
//比如
str1 = 'ценности ни в ';
//把获取到的数据 转换成Buffer，记得格式使用 binary
//binary在各编码直接穿梭无阻~
var buf = new Buffer(str1,'binary');
var str2 = iconv.decode(buf, 'win1251');
//str2就被转换出来了，默认是转成 Unicode格式，估计这也是iconv-lite的初衷吧
console.log(str2);

Copy after login

Instructions for installing and using the nodejs module iconv are available at https://github.com/bnoordhuis/node-iconv

(In fact, the essence is to install node-gyp. I didn’t read the official instructions carefully before)

Generally, after simple use, the code is still garbled. The format is: пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅ

http://stackoverflow.com/questions/8693400/nodejs-convertinf-from-windows-1251-to-utf-8

The solution is to convert the read data into binary encoding: binary (the default encoding is utf-8)

request({ 
  uri: website_url,
  method: 'GET',
  encoding: 'binary'
}, function (error, response, body) {
    body = new Buffer(body, 'binary');
    conv = new iconv.Iconv('WINDOWS-1251', 'utf8');
    body = conv.convert(body).toString();
  }
});

Copy after login

--> In addition, the use of iconv requires some environmental dependencies. See the official instructions: https://github.com/TooTallNate/node-gyp

So:

Firstly, you need the support of python corresponding version (such as 2.7);

Second, it requires the support of compilation tools (most errors occur under Windows)

Error similar to this

Node, if there is no specific version or higher, the vs2005 compilation tool is used by default (so the solution to the error message is generally to follow vs2005 and framwork sdk2.0)

Problem solution:

1. Install visual studio 2010

2. Specify the vs compilation tool version (if it is vs2012, it is 2012)

(Sometimes it will be automatically specified, so this command is not necessarily needed npm config set msvs_version 2010 --global)

3. If it still prompts that the framwork sdk cannot be found, you can add its installation path to the system environment variable path

(2010 corresponds to sdk4.0 version, similar to 2008 sdj3.5 2012 sdk4.5?)

Another thing to remember is that the environment variable will only read the first one!

For example, if you have set the path of SDK2.0 to the system environment variable before, then when you add and set the path of SDK4.0 now, only the first one will work

So:

Or delete the previous one

Or put the path you want to add in front of it

2. Gzip page processing

Sometimes we find that it is normal for the browser to access the page, but the simulated request is garbled when it comes back. You can check the Response information requested by the browser. If there is Content-Encoding: gzip, it is most likely because the page is compressed by gzip. , then you need to add the following parameters when requesting

gzip:true

The above is the entire content of this article, I hope you all like it.