When I crawled a web page, I noticed that its page turning was implemented by such a function. After turning the page, the page URL did not change:
<input class="buttonJump" name="goto2" onclick="dirGroupMblogToPage(document.getElementById('dirGroupMblogcp2').value)" type="button" value="Go"/>
</input>
function dirGroupMblogToPage(currentPage){
jQuery.post("dirGroupMblog.action", {"page.currentPage":currentPage,gid:MI.TalkBox.gid}, function(data){$("#talkMain").html(data);
window.scrollTo(0, $css.getY(MI.talkList._body)-65);
});
}
Written a function like this to try to achieve page turning:
def login_page(login_url, content_url, usr_name="******@126.com", passwd="******"):
# 实现登录, 返回Session对象和获得的页面
post_data = {'r': 'on', 'u': usr_name, 'p': passwd}
s = requests.Session()
s.post(login_url, post_data)
r = s.get(content_url)
return s, r
def turn_page(s, next_page, content_url):
post_url = "http://sns.icourses.cn/dirGroupMblog.action"
post_headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
"X-Requested-With":"XMLHttpRequest"}
post_data = {"page.currentPage": next_page, "gid": 2632}
s.post(post_url, data=post_data, headers = post_headers)
res = s.get(content_url)
return res
But page turning failed after calling turn_page(). How should we solve this problem? Also, what kind of knowledge do we need to learn by ourselves to solve this kind of problem? Thank you!
Recommended to use selenium
For example, if you need to click the next page button on the interface, or you need to enter the up, down, left, and right keys, the page can be turned, selenium webdriver can do it, and give a reference (I used to crawl the novels on Qidian Chinese website )
Selenium can interact with the page, click, double-click, enter, wait for the page to load (implicit wait, and explicit wait). . . .
There are several situations,
1. The page can be turned by sliding or clicking through the js effect;
2. The page can be turned by clicking on the hyperlink;
You can use the network analysis in Chrome's developer tools to get the result, whether it is an html page or a feedback json rendering.
json is easier to handle, just get the result directly. Ordinary html pages need to use regular matching to page breaks. Then put the link into the pool to be crawled.
/a/11...