Python Instant Web Crawler: API Description-Python Tutorial-php.cn

Python Instant Web Crawler: API Description

高洛峰

Release： 2016-11-22 16:24:15

Original

1475 people have browsed it

API description - Download gsExtractor content extractor

1, interface name

Download content extractor

2, interface description

If you want to write a web crawler program, you will find that most of the time is spent debugging web pages In terms of content extraction rules, regardless of how weird the syntax of regular expressions is, even if you use XPath, you have to write and debug them one by one.

If you want to extract many fields from a web page, debugging XPath one by one will be very time-consuming. Through this interface, you can directly obtain a debugged extractor script program, which is a standard XSLT program. You only need to run it against the DOM of the target web page to obtain the results in XML format, with all fields obtained at once.

This XSLT extractor can be generated by you using MS software, or it can be shared with you by others. As long as you have read permission, you can download and use it.

In web crawler programs used for data analysis and data mining, the content extractor is a key obstacle to universality. If this extractor is obtained from the API, your web crawler program can be written as a universal framework.

3, Interface specification

3.1, Interface address (URL)

http://www.gooseeker.com/api/getextractor

3.2, Request type (contentType)

No limit

3.3, Request method

HTTP GET

3.4, request parameter

key Required: Yes; Type: String; Description: AppKey assigned when applying for API

theme Required: Yes; Type: String; Description: Extractor name, which is used The rule name defined by MS Mooshutai

middle Required: No; Type: String; Description: Rule number. If multiple rules are defined under the same rule name, you need to fill in

bname Required: No; Type: String; Description: Sorting box name. If the rule contains multiple sorting boxes, you need to fill in

3.5, return type (contentType)

text/xml; charset=UTF-8

3.6, return parameters

Parameters in the HTTP message header , as follows:

more-extractor Type: String; Description: How many extractors are there under the same rule name. Usually you only need to pay attention to this parameter when the optional parameters are not filled in to remind the client that there are multiple rules and sorting boxes. The client decides whether to carry clear parameters when sending the request

3.7 and return an error message

Message layer errors are returned with HTTP 400. For example, the parameters in the URL do not comply with this specification.

Application layer errors are returned with HTTP 200 OK. The specific error code is placed in the message body in an XML file. The XML structure is as follows:

<return>
    <code>具体的错误码</code>
</return>

Copy after login

Specific The code value is as follows: keyError: Permission verification failed

keyError：权限验证失败
paramError：URL中传来的参数有误，比如，参数名称或值不正确

Copy after login

4, usage example (python language)

Sample code:

# -*- coding: utf-8 -*-
from urllib import request

url = &#39;http://www.gooseeker.com/api/getextractor?key=您的key&theme=您的提取器名&#39;

resp = request.urlopen(url)
content = resp.read()
if(content):
    print(content)

Copy after login

Next I will test this API