Using php crawler to analyze Nanjing housing prices
A few days ago I saw an article on csdn, using Python to write a crawler to analyze Shanghai housing prices. It feels quite interesting. It just so happens that I recently wrote about article collection in the snake backend. I will also use the PHP crawler to analyze Nanjing's housing prices. Let’s get started.
The dependency files of this crawler: The first is the CURL class of the master ares333. I am using an early version. This is the github project address of https://github.com/ares333/php-curlmulti. The curl he wrote is really awesome!
The collection uses phpQuery. If you don’t know this category, you can search it on Baidu yourself.
As for the source of the data, I chose Anjuke. The amount of data is still acceptable. I opened the Anjuke channel to Nanjing. Start analyzing their page structure. As for how to use phpQuery to analyze the page structure collection method, I will not introduce it in detail here. After analyzing the structure, OK, start building the data table. First, create an area table. House transactions are divided into sections. The section table structure is as followsCREATE TABLE `area` (<br> `id` int(11) NOT NULL AUTO_INCREMENT,<br> `name` varchar(155) NOT NULL COMMENT 'Nanjing City District',<br> `url` varchar(155) NOT NULL COMMENT 'House area connection',<br> `pid` int(2) NOT NULL COMMENT 'Category',<br> PRIMARY KEY (`id`)<br>) ENGINE=MyISAM DEFAULT CHARSET=utf8;
I first added some district server data myself. In fact, I can collect these, because there are only a few districts with limited addresses, so I added them directly. 14 pieces of data have been added:
After the initial data is ready, you can start collecting all regional block entrance addresses. Paste the code
area.php
<?php // +---------------------------------------------------------------------- // | 采集区域脚本 // +---------------------------------------------------------------------- // | Author: NickBai <1902822973@qq.com> // +---------------------------------------------------------------------- set_time_limit(0); require 'init.php'; //根据大区信息前往抓取 $sql = "select * from `area`"; $area = $db->query( $sql )->fetchAll( PDO::FETCH_ASSOC ); foreach($area as $key=>$vo){ $url = $vo['url']; $result = $curl->read($url); $charset = preg_match("/<meta.+?charset=[^\w]?([-\w]+)/i", $result['content'], $temp) ? strtolower( $temp[1] ) : ""; phpQuery::$defaultCharset = $charset; //设置默认编码 $html = phpQuery::newDocumentHTML( $result['content'] ); $span = $html['.items .sub-items a']; $st = $db->prepare("insert into area(name,url,pid) values(?,?,?)"); foreach($span as $v){ $v = pq( $v ); //为方便分页抓取,先加入分页规则 $href = trim( $v->attr('href') ) . 'p*/#filtersort'; $st->execute([ trim( $v->text() ), $href, $vo['id']]); } }
The single piece of data collected is as follows: Baijiahu http://nanjing.anjuke.com/sale/baijiahu/p*/#filtersort Data address I have them all, and I added * to the page address, so that it can be replaced. When you open the program, you can start collecting documents on other pages under each module. The most important main program is about to begin;
Create a new hdetail table to record the collected house number information:
CREATE TABLE `hdetail` ( `id` int(11) NOT NULL AUTO_INCREMENT, `pid` int(5) NOT NULL COMMENT '区域id', `square` int(10) DEFAULT NULL COMMENT '面积', `housetype` varchar(55) DEFAULT '' COMMENT '房屋类型', `price` int(10) DEFAULT '0' COMMENT '单价', `allprice` int(10) DEFAULT '0' COMMENT '总价', `name` varchar(155) DEFAULT '' COMMENT '小区名称', `addr` varchar(155) DEFAULT '' COMMENT '小区地址', PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Now that the database is available, the main program is presented.
<?php // +---------------------------------------------------------------------- // | 采集各区具体房源信息 // +---------------------------------------------------------------------- // | Author: NickBai <1902822973@qq.com> // +---------------------------------------------------------------------- set_time_limit(0); require 'init.php'; //查询各板块数据 $sql = "select * from `area` where id > 14"; $allarea = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); ////m.sbmmt.com/页面不存在时,会跳转到首页 foreach($allarea as $key=>$vo){ $url = $vo['url']; $i = 1; while ( true ){ $urls = str_replace( "*" , $i, $url); $result = $curl->read( $urls ); if( "http://nanjing.anjuke.com/sale/" == $result['info']['url'] ){ break; } $charset = preg_match("/<meta.+?charset=[^\w]?([-\w]+)/i", $result['content'], $temp) ? strtolower( $temp[1] ) : ""; phpQuery::$defaultCharset = $charset; //设置默认编码 $html = phpQuery::newDocumentHTML( $result['content'] ); $p = $html['#houselist-mod li .house-details']; $isGet = count( $p->elements ); //未采集到内容跳出,视为结束 if( 0 == $isGet ){ break; } foreach($p as $v){ $sql = "insert into hdetail(pid,square,housetype,price,allprice,name,addr) "; $pid = $vo['id']; $square = rtrim( trim( pq($v)->find("p:eq(1) span:eq(0)")->text() ), "平方米"); $htype = trim( pq($v)->find("p:eq(1) span:eq(1)")->text() ); $price = rtrim ( trim( pq($v)->find("p:eq(1) span:eq(2)")->text() ), "元/m²"); $area = explode(" ", trim( pq($v)->find("p:eq(2) span")->text() ) ); $name = str_replace( chr(194) . chr(160), "", array_shift($area) ); //utf-8中的空格无法用trim去除,所以采用此方法 $addr = rtrim( ltrim (trim( array_pop($area) ) , "["), "]" ); $allprice = trim( pq($v)->siblings(".pro-price")->find("span strong")->text() ); $sql .= " value( ". $pid .",". $square .", '". $htype ."' ,". $price .",". $allprice .", '". $name ."' ,'". $addr ."' )"; $db->query($sql); } echo mb_convert_encoding($vo['name'], "gbk", "utf-8") . " PAGE : ". $i . PHP_EOL; $i++; } }
If you feel it is slow, you can copy a few parts of the house.php file, modify
$sql = "select * from `area` where id > 14";
Now it’s time to wait. I collected it on 8.16 and collected a total of 311,226 pieces of data. Okay, now that we have the numbers, we can start analyzing. The code I analyzed is as follows:
<?php require "init.php"; $data = unserialize( file_get_contents('./data/nj.data') ); if( empty( $data ) ){ //全南京 $sql = "select avg(price) price from hdetail"; $nanjing = intval( $db->query($sql)->fetch( PDO::FETCH_ASSOC )['price'] ); //其余数据 $data = [ $nanjing, getOtherPrice('2,3,4,5,6,7,8,10'), getOtherPrice('1'), getOtherPrice('2'), getOtherPrice('3'), getOtherPrice('4'), getOtherPrice('5'), getOtherPrice('6'), getOtherPrice('7'), getOtherPrice('8'), getOtherPrice('9'), getOtherPrice('10'), getOtherPrice('11'), getOtherPrice('12'), getOtherPrice('13') ]; //添加缓存 file_put_contents('./data/nj.data', serialize( $data )); } //均价最高TOP10 $sql = "select avg(price) price,name from hdetail GROUP BY name ORDER BY price desc limit 10"; $res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); $x = ""; $y = ""; foreach($res as $vo){ $x .= "'" . $vo['name'] . "',"; $y .= intval( $vo['price'] ). ","; } //均价最低TOP10 $sql = "select avg(price) price,name from hdetail GROUP BY name ORDER BY price asc limit 10"; $res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); $xl = ""; $yl = ""; foreach($res as $vo){ $xl .= "'" . $vo['name'] . "',"; $yl .= intval( $vo['price'] ). ","; } //交易房型数据 $sql = "select count(0) allnum, housetype from hdetail GROUP BY housetype order by allnum desc"; $res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); $htype = ""; foreach($res as $vo){ $htype .= "[ '" . $vo['housetype'] . "', " .$vo['allnum']. "],"; } $htype = rtrim($htype, ','); //交易的房屋面积数据 $square = ['50平米以下', '50-70平米', '70-90平米', '90-120平米', '120-150平米', '150-200平米', '200-300平米', '300平米以上']; $sql = "select count(0) allnum, square from hdetail GROUP BY square"; $squ = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); $p50 = 0; $p70 = 0; $p90 = 0; $p120 = 0; $p150 = 0; $p200 = 0; $p250 = 0; $p300 = 0; foreach($squ as $key=>$vo){ if( $vo['square'] < 50 ){ $p50 += $vo['allnum']; } if( $vo['square'] >= 50 && $vo['square'] < 70 ){ $p70 += $vo['allnum']; } if( $vo['square'] >= 70 && $vo['square'] < 90 ){ $p90 += $vo['allnum']; } if( $vo['square'] >= 90 && $vo['square'] < 120 ){ $p120 += $vo['allnum']; } if( $vo['square'] >= 120 && $vo['square'] < 150 ){ $p150 += $vo['allnum']; } if( $vo['square'] >= 150 && $vo['square'] < 200 ){ $p200 += $vo['allnum']; } if( $vo['square'] >= 200 && $vo['square'] < 300 ){ $p250 += $vo['allnum']; } if( $vo['square'] >= 300 ){ $p300 += $vo['allnum']; } } $num = [ $p50, $p70, $p90, $p120, $p150, $p200, $p250, $p300 ]; $sqStr = ""; foreach($square as $key=>$vo){ $sqStr .= "[ '" . $vo . "', " .$num[$key]. "],"; } //根据获取ids字符串获取对应的均价信息 function getOtherPrice($str){ global $db; $sql = "select id from area where pid in(" . $str . ")"; $city = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC ); $ids = ""; foreach($city as $v){ $ids .= $v['id'] . ","; } $sql = "select avg(price) price from hdetail where pid in (".rtrim($ids, ",").")"; $price = intval( $db->query($sql)->fetch( PDO::FETCH_ASSOC )['price'] ); return $price; } ?> <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>南京房价分析</title> <link rel="shortcut icon" href="favicon.ico"> <link href="css/bootstrap.min.css?v=3.3.6" rel="stylesheet"> <link href="css/font-awesome.min.css?v=4.4.0" rel="stylesheet"> <link href="css/animate.min.css" rel="stylesheet"> <link href="css/style.min.css?v=4.1.0" rel="stylesheet"> </head> <body class="gray-bg"> <p class="wrapper wrapper-content"> <p class="row"> <p class="col-sm-12"> <p class="row"> <p class="col-sm-12"> <p class="ibox float-e-margins"> <p class="ibox-title"> <h5>全南京以及各区二手房均价</h5> <p class="ibox-tools"> <a class="collapse-link"> <i class="fa fa-chevron-up"></i> </a> <a class="close-link"> <i class="fa fa-times"></i> </a> </p> </p> <p class="ibox-content"> <p id="container"></p> </p> </p> </p> </p> </p> </p> <p class="row"> <p class="col-sm-6"> <p class="row"> <p class="col-sm-12"> <p class="ibox float-e-margins"> <p class="ibox-title"> <h5>均价最高的小区TOP10</h5> <p class="ibox-tools"> <a class="collapse-link"> <i class="fa fa-chevron-up"></i> </a> <a class="close-link"> <i class="fa fa-times"></i> </a> </p> </p> <p class="ibox-content"> <p id="avgpriceh"></p> </p> </p> </p> </p> </p> <p class="col-sm-6"> <p class="row"> <p class="col-sm-12"> <p class="ibox float-e-margins"> <p class="ibox-title"> <h5>均价最低的小区TOP10</h5> <p class="ibox-tools"> <a class="collapse-link"> <i class="fa fa-chevron-up"></i> </a> <a class="close-link"> <i class="fa fa-times"></i> </a> </p> </p> <p class="ibox-content"> <p id="avgpricel"></p> </p> </p> </p> </p> </p> </p> <p class="row"> <p class="col-sm-6"> <p class="row"> <p class="col-sm-12"> <p class="ibox float-e-margins"> <p class="ibox-title"> <h5>交易房型比例</h5> <p class="ibox-tools"> <a class="collapse-link"> <i class="fa fa-chevron-up"></i> </a> <a class="close-link"> <i class="fa fa-times"></i> </a> </p> </p> <p class="ibox-content"> <p id="htype"></p> </p> </p> </p> </p> </p> <p class="col-sm-6"> <p class="row"> <p class="col-sm-12"> <p class="ibox float-e-margins"> <p class="ibox-title"> <h5>交易房屋面积比例</h5> <p class="ibox-tools"> <a class="collapse-link"> <i class="fa fa-chevron-up"></i> </a> <a class="close-link"> <i class="fa fa-times"></i> </a> </p> </p> <p class="ibox-content"> <p id="square"></p> </p> </p> </p> </p> </p> </p> </p> <script type="text/javascript" src="js/jquery.min.js?v=2.1.4"></script> <script type="text/javascript" src="js/bootstrap.min.js?v=3.3.6"></script> <script type="text/javascript" src="http://cdn.hcharts.cn/highcharts/highcharts.js"></script> <script type="text/javascript"> $(function () { $('#container').highcharts({ chart: { type: 'column' }, title: { text: '全南京以及各区二手房均价' }, subtitle: { text: '来源于安居客8.16的数据' }, xAxis: { categories: ['全南京','江南八区','江宁区','鼓楼区','白下区','玄武区','建邺区','秦淮区','下关区','雨花台区','浦口区','栖霞区','六合区', '溧水区','高淳区','大厂'], crosshair: true }, yAxis: { min: 0, title: { text: '元/m²' } }, tooltip: { headerFormat: '<span style="font-size:10px">{point.key}</span><table>', pointFormat: '<tr><td style="color:{series.color};padding:0">{series.name}: </td>' + '<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>', footerFormat: '</table>', shared: true, useHTML: true }, plotOptions: { column: { pointPadding: 0.2, borderWidth: 0, dataLabels:{ enabled:true// dataLabels设为true } } }, series: [{ name: '平均房价', data: [<?php echo implode(',', $data); ?>] }] }); //均价最高top10 $('#avgpriceh').highcharts({ chart: { type: 'column' }, title: { text: '均价最高的小区TOP10' }, subtitle: { text: '来源于安居客8.16的数据' }, xAxis: { categories: [<?=$x; ?>], crosshair: true }, yAxis: { min: 0, title: { text: '元/m²' } }, tooltip: { headerFormat: '<span style="font-size:10px">{point.key}</span><table>', pointFormat: '<tr><td style="color:{series.color};padding:0">{series.name}: </td>' + '<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>', footerFormat: '</table>', shared: true, useHTML: true }, plotOptions: { column: { pointPadding: 0.2, borderWidth: 0, dataLabels:{ enabled:true// dataLabels设为true } } }, series: [{ name: '平均房价', data: [<?=$y; ?>] }] }); //均价最低top10 $('#avgpricel').highcharts({ chart: { type: 'column' }, title: { text: '均价最低的小区TOP10' }, subtitle: { text: '来源于安居客8.16的数据' }, xAxis: { categories: [<?=$xl; ?>], crosshair: true }, yAxis: { min: 0, title: { text: '元/m²' } }, tooltip: { headerFormat: '<span style="font-size:10px">{point.key}</span><table>', pointFormat: '<tr><td style="color:{series.color};padding:0">{series.name}: </td>' + '<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>', footerFormat: '</table>', shared: true, useHTML: true }, plotOptions: { column: { pointPadding: 0.2, borderWidth: 0, dataLabels:{ enabled:true// dataLabels设为true } } }, series: [{ name: '平均房价', data: [<?=$yl; ?>] }] }); // Radialize the colors Highcharts.getOptions().colors = Highcharts.map(Highcharts.getOptions().colors, function (color) { return { radialGradient: { cx: 0.5, cy: 0.3, r: 0.7 }, stops: [ [0, color], [1, Highcharts.Color(color).brighten(-0.3).get('rgb')] // darken ] }; }); //房型类型 $('#htype').highcharts({ chart: { plotBackgroundColor: null, plotBorderWidth: null, plotShadow: false }, title: { text: '交易的二手房型比例' }, tooltip: { pointFormat: '{series.name}: <b>{point.percentage:.1f}%</b>' }, plotOptions: { pie: { allowPointSelect: true, cursor: 'pointer', dataLabels: { enabled: true, format: '<b>{point.name}</b>: {point.percentage:.1f} %', style: { color: (Highcharts.theme && Highcharts.theme.contrastTextColor) || 'black' }, connectorColor: 'silver' } } }, series: [{ type: 'pie', name: 'Browser share', data: [ <?=$htype; ?> ] }] }); //房型面积类型 $('#square').highcharts({ chart: { plotBackgroundColor: null, plotBorderWidth: null, plotShadow: false }, title: { text: '交易的二手房面积比例' }, tooltip: { pointFormat: '{series.name}: <b>{point.percentage:.1f}%</b>' }, plotOptions: { pie: { allowPointSelect: true, cursor: 'pointer', dataLabels: { enabled: true, format: '<b>{point.name}</b>: {point.percentage:.1f} %', style: { color: (Highcharts.theme && Highcharts.theme.contrastTextColor) || 'black' }, connectorColor: 'silver' } } }, series: [{ type: 'pie', name: 'Browser share', data: [ <?=$sqStr; ?> ] }] }); }); </script> </body> </html>
The page effect is as follows:
Haha, the housing prices are really scary, second-hand houses are already at this price. If there is any interesting information, you can discover it yourself.