84669 person learning
152542 person learning
20005 person learning
5487 person learning
7821 person learning
359900 person learning
3350 person learning
180660 person learning
48569 person learning
18603 person learning
40936 person learning
1549 person learning
1183 person learning
32909 person learning
分别从几个固定的网站上爬取数据;为了url去重,我用的字符串型存储?还是用的sets型存储?
需要存储url数目,大概初期在100k-1000k之间。
Collect with redisLink
Use collections, the non-repetitiveness of collections is so applicable.
$key = 'URL_HASH'; if(!$redis->hGet($key, md5($url))){ // do something ... // 抓取一个 $url 后 $redis->hSet($key, md5($url), true); }
It should be noted here that if it is multi-threaded, other processes must be considered. You can change the bool value to an enumeration value.
Collect with redis
Link
Use collections, the non-repetitiveness of collections is so applicable.
It should be noted here that if it is multi-threaded, other processes must be considered. You can change the bool value to an enumeration value.