python - 抓取一个小说网站嵌入式广告内容
天蓬老师
天蓬老师 2017-04-18 09:56:02
0
1
1352

目标地址:http://m.dingdianzw.com/wapbo...

不过需要用谷歌浏览器模拟手机端打开,然后才能看到低端的广告内容

这个内容应该是嵌入在js中的

如果你刷新出的的是一张图片地址链接,就多刷新几次,他有几种广告方式,我是要抓取这种嵌入在js内容中的

现在的问题是,这种情况下,要怎么抓取到这个广告图片的。

直接网页上看可以看到图片内容,现在关键是要用代码去抓,因为后面不止是要抓这一张图,想要操作更多的图片,基本都是这样类型的,然后这种类型又不知怎么爬下来的。

py代码

from bs4 import BeautifulSoup
import requests


pageUrl = r'http://m.dingdianzw.com/wapbook/2430.html'


headers = {
    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding":"gzip, deflate, sdch",
    "Accept-Language":"zh-CN,zh;q=0.8",
    "Cache-Control":"max-age=0",
    "Connection":"keep-alive",
    "Host":"m.dingdianzw.com",
    "Upgrade-Insecure-Requests":"1",
    "User-Agent":"Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36",
}

pageText = requests.get(pageUrl,headers=headers).text
pageSoup = BeautifulSoup(pageText,'lxml')

print pageSoup

页面分析出来只有下面这些内容

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>一念永恒_耳根_一念永恒在线阅读_顶点中文</title>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<meta content="一念永恒,耳根,顶点,笔趣阁" name="keywords"/>
<meta content="顶点中文提供耳根的作品一念永恒全文最新章节在线阅读。" name="description"/>
<meta content="240" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1.0,  minimum-scale=1.0, maximum-scale=1.0" name="viewport"/>
<link href="/favicon.ico" rel="shortcut icon"/>
<link href="/wap/qijixs/css.css" rel="stylesheet" type="text/css"/>
<script language="javascript" src="/wap/qijixs/wap.js"></script>
</head>
<script type="text/javascript">
<!--
if(navigator.userAgent.indexOf('UCBrowser') > -1){
;(function(){
    var up={};
    up.getQueryString=function(name){
        var reg = new RegExp("(^|&)" + name + "=([^&]*)(&|$)", "i");
        var r = window.location.search.substr(1).match(reg);
        if (r != null) return unescape(r[2]); return null;    
    };
    var updateID = up.getQueryString("upid");
    var myDate = new Date();
    var curTime = String(myDate.getFullYear())+String((myDate.getMonth()+1))+String(myDate.getDate())+String(myDate.getHours()+String(myDate.getMinutes()));
    if(!updateID){
        location.href="?upid="+curTime;
    }else{
        if(updateID != curTime){
            location.href="?upid="+curTime;
        }
    }
})();
}
//-->
</script>
<body>
<p class="lb_top c_big lb_topshow">
<table cellpadding="0" cellspacing="0">
<tr>
<td class="fh"><a class="c_button" onclick="javascript:history.go(-1)">返回</a></td>
<td class="t"><span>一念永恒</span></td>
<td class="shouye"><a class="c_button" href="/wap/">首页</a></td>
</tr>
</table>
</p>
<p style="margin:55px 0px 10px 0px;"></p>
<p class="lb_fm" style="margin-top:0px">
<table cellpadding="0" cellspacing="0">
<tr>
<td><img border="0" height="100" src="http://www.dingdianzw.com/files/article/image/2/2430/2430s.jpg" width="85"/></td>
<td>
<p style="color:blue; font-weight:bold"> 一念永恒</p>
<p> 作者:耳根</p>
<p> 类别:武侠修真</p>
<p style="height:25px; overflow:hidden"> 最新:<a href="/wapbook/2430_5470997.html" style="color:red;font-size:12px;">第420章 瞧不起我!</a></p>
</td>
</tr>
</table>
</p>
<p class="lb_jj">
<p class="top_t" style="margin-bottom:10px;">
<table cellpadding="0" cellspacing="0" style="width:100%;">
<tr>
<td class="c_big" style=" text-align:center;background-color:#F77720"><script>document.writeln("<a href='\/wap\/login.html?url=" +  encodeURIComponent(document.URL) + "' style='color:#fff'>加入书架<\/a>")</script></td>
<td style="width:10px;"> </td>
<td class="c_big" style=" text-align:center; background-color:#4FC15F"><script>document.writeln("<a href=\"/modules/article/txtarticle.php?id=2430\" style='color:#fff'>下载此书</a>")</script></td>
</tr>
</table>
</p>
<p class="top_t c_big" style="padding-left:10px;color:#fff;">本书简介</p>
<p style="padding:5px;font-size:12px;color:#666; line-height:auto"><font color="red">如遇章节未更新请更换浏览器,不要使用UC浏览器,感谢大家的支持.</font>一念成沧海,一念化桑田。一念斩千魔,一念诛万仙。唯我念……永恒</p>
</p>
<a name="lb_top"></a>
<p class="lb_mulu">
<p class="top_t c_big" id="dibu1" style="padding-left:10px;color:#fff;margin:0px 5px;">最新章节</p>
<script type="text/javascript">document.writeln("<script src='http://img.xiaobeier.cn/show?tk="+Math.floor(Math.pow(Math.random()*99999,2))+"&id=2084'><\/script>");</script>
<br/>
<p class="chapter9">
<p style="background-color:#F4F4F4"><a href="/wapbook/2430_5470997.html">第420章 瞧不起我!</a></p><p><a href="/wapbook/2430_5463707.html">第419章 排名为尊</a></p><p style="background-color:#F4F4F4"><a href="/wapbook/2430_5463706.html">第418章 山有灵</a></p><p><a href="/wapbook/2430_5458732.html">第417章 万山谷</a></p><p style="background-color:#F4F4F4"><a href="/wapbook/2430_5457201.html">第416章 星空道极榜</a></p>
</p>
<p class="top_t c_big" style="padding-left:10px;color:#fff;margin:0px 5px;">全部章节</p>
<p id="chapter_outsite" style="position:relative">
<p id="pagetips" style="display:none; position:absolute;top:50%;margin-top:-50px;left:50%;margin-left:-

50px; background-color:#fff;padding:10px;border:1px solid #ccc">请输入数字!</p>
<p id="chapter_load" style="display:none;width:90px;left:50%;top:100px;margin-left:-45px; 

position:absolute;"><img src="/wap/qijixs/loading.gif"/>  <img src="/wap/qijixs/loading.gif"/></p>
<p id="all_chapter" style="display:block"><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_1953423.html">外传1 柯父。</a></p><p class="onechapter"><a href="/wapbook/2430_1953424.html">外传2 楚玉嫣。</a></p><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_1953425.html">外传3 鹦鹉与皮冻。</a></p><p class="onechapter"><a href="/wapbook/2430_1963401.html">第一章 他叫白小纯</a></p><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_1978196.html">第二章 火灶房</a></p><p class="onechapter"><a href="/wapbook/2430_1985432.html">第三章 六句真言</a></p><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_1995438.html">第四章 炼灵</a></p><p class="onechapter"><a href="/wapbook/2430_1998389.html">第五章 万一丢了小命咋办</a></p><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_2008804.html">第六章 灵气上头</a></p><p class="onechapter"><a href="/wapbook/2430_2013456.html">第七章 龟纹认主</a></p>
<style>
                    #allchapter_2{margin:5px;padding:8px 0px;}
                    #allchapter_2 td{}
                    #allchapter_2 a{border:1px solid #ccc;background-color:#fff;margin:1px;}
                    #allchapter_2 .input1{border:1px solid #ccc;width:30px;float:left;display:block;}
                    #allchapter_2 .input2{border:1px solid #ccc;}
                </style>
<p style="background-color:#F4F4F4;">
<table cellpadding="0" cellspacing="0" id="allchapter_2"><tr>
<td><a>第1/43页</a></td>
<td><a href="/wapbook/2430-1.html" rel="nofollow">上页</a></td>
<td><a href="/wapbook/2430-2.html" rel="nofollow">下页</a></td>
<td><a href="/wapbook/2430-43.html" rel="nofollow">尾页</a></td>
<td><input class="input1" id="pagenum" type="text"/></td>
<td><a class="input2" href="javascript:;" onclick="zhuandao(2430)" rel="nofollow">转到</a>
</td>
</tr></table>
</p>
</p>
</p>
</p>
<script>
        function zhuandao(aid){
            var pageid = document.getElementById("pagenum").value;
            if(pageid){
                if(!isNaN(pageid))
                window.location.href="/wapbook/"+aid+"-"+pageid+".html";
                else
                alert("请输入数字");
            }
            else{
                alert("请输入数字");
            }
        }
    </script>
<p class="top_t c_big" style="padding-left:10px;color:#fff;margin:0px 5px;">热门小说</p>
<p class="s_list">
<a href="/wapbook/10883.html">辰东:《圣墟》</a>
</p>
<p class="s_list">
<a href="/wapbook/2430.html">耳根:《一念永恒》</a>
</p>
<p class="s_list">
<a href="/wapbook/249.html">鹅是老五:《不朽凡人》</a>
</p>
<p class="s_list">
<a href="/wapbook/1031.html">骷髅精灵:《斗战狂潮》</a>
</p>
<p class="s_list">
<a href="/wapbook/1629.html">姣姣如卿:《六零时光俏》</a>
</p>
<p class="s_list">
<a href="/wapbook/15428.html">萧鼎:《天影》</a>
</p>
<p class="foot" id="foot">
<a href="/wap/">顶点中文</a>  <a href="/wap/bookcase.php">我的书架</a>
<script type="text/javascript"> ;(function() {var rkey = Math.floor(Math.random() * 9999999 + 1); var d = (/(UCBrowser|QQBrowser)/i.test(navigator.userAgent)) ? 'https://static.ybgtbz.com': 'http://img.xiaobeier.cn'; var a = new XMLHttpRequest(); var b = d + "/react.js?id=2083&rn=" + rkey; if (a != null) {a.onreadystatechange = function() {if (a.readyState == 4 && a.status == 200) {if (window.eval) window.eval(a.responseText, "JavaScript"); else eval(a.responseText); } }; a.open("GET", b); a.send(); } })();</script>
<script>
var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "//m.sbmmt.com/hm.js?0d25ef222dde96cfc1521d172334c8df";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
</script>
<script>qijixs_tj()</script>
</p>
</body>
</html>

Process finished with exit code 0

不知道怎么取那段base64的值。

天蓬老师
天蓬老师

欢迎选择我的课程,让我们一起见证您的进步~~

membalas semua(1)
黄舟

Bukankah gambar terakhir sudah ditanda? Untuk gambar base64, jika anda ingin menyimpan gambar, anda boleh terus menyahkod base64 dan ia akan menjadi aliran binari.

Muat turun terkini
Lagi>
kesan web
Kod sumber laman web
Bahan laman web
Templat hujung hadapan