Use Scrapy crawler to capture paper data in the field of Deep Learning-Python Tutorial-php.cn

Use Scrapy crawler to capture paper data in the field of Deep Learning

WBOY

Release： 2023-06-23 09:33:22

Original

1885 people have browsed it

Deep learning is one of the most popular and cutting-edge research directions in the field of artificial intelligence. For scholars and practitioners engaged in related research, obtaining data sets is an important prerequisite for conducting in-depth research. However, most high-quality Deep Learning research papers are published through top international academic conferences (such as NeurIPS, ICLR, ICML, etc.), and it is difficult to obtain these papers. Therefore, this article will introduce how to use Scrapy crawler technology to crawl paper data in the field of Deep Learning.

First, we need to determine the target website to crawl. Currently, the more popular websites that manage Deep Learning papers include arXiv and OpenReview. In this article, we choose to crawl the arXiv data. arXiv is a website that manages scientific papers, including papers in many fields, including papers in the field of Deep Learning. At the same time, the arXiv website also provides a convenient API interface, allowing our crawler program to easily obtain paper data.

Next, we can start writing the Scrapy crawler program. First, enter the following command in the terminal to create a Scrapy project:

scrapy startproject deep_learning_papers

Copy after login

After creation, enter the project directory and create a Spider:

cd deep_learning_papers
scrapy genspider arXiv_spider arxiv.org

Copy after login

Here we name the Spider "arXiv_spider", And specify the crawling website as arxiv.org. After creation, open the arXiv_spider.py file, and we can see the following code:

import scrapy


class ArxivSpiderSpider(scrapy.Spider):
    name = 'arXiv_spider'
    allowed_domains = ['arxiv.org']
    start_urls = ['http://arxiv.org/']

    def parse(self, response):
        pass

Copy after login

This is the simplest Spider template. We need to write the parse method as a function to capture paper information. Since the paper information is obtained through the API interface, we need to send a GET request. We can use the requests module in Python to send requests. Here we write a function that sends a request:

import requests

def get_papers_data(start, max_results):
    url = 'http://export.arxiv.org/api/query?search_query=all:deep+learning&start=' + str(start) + '&max_results=' + str(max_results)
    headers = {'Content-Type': 'application/json'}
    response = requests.get(url, headers=headers)
    return response.content

Copy after login

The get_papers_data function receives two parameters, namely the starting position and the maximum number. We pass "all:deep learning" to the search_query parameter so that we can obtain all paper information in the field of Deep Learning. After sending a GET request using requests, we can get the data from response.content.

In the parse method, we parse the returned data. We can use XPath expressions to quickly obtain content. The specific code is as follows:

  def parse(self, response):
        for i in range(0, 50000, 100):
            papers = get_papers_data(i, 100)
            xml = etree.XML(papers)

            for element in xml.iter():
                if element.tag == 'title':
                    title = element.text
                elif element.tag == 'name':
                    name = element.text
                elif element.tag == 'abstract':
                    abstract = element.text

                yield {'title': title, 'name': name, 'abstract': abstract}

Copy after login

Here we use a loop operation of up to 50,000 times, starting from 0 and increasing by 100 each time until we obtain the information of all Deep Learning papers. Then, we use etree.XML to parse the obtained data into XML format, and then read each element one by one. When the element's tag is 'title', 'name' or 'abstract', we assign the element content to the corresponding variable, and finally use yield to return the parsing result.

Finally, we need to start the crawler program:

scrapy crawl arXiv_spider -o deep_learning_papers.csv

Copy after login

The "-o" parameter is used here to specify the output file, which defaults to JSON format. Here we choose the CSV format, and the output file is named "deep_learning_papers.csv".

Through Scrapy crawler technology, we can easily obtain paper information in the field of Deep Learning. By combining other data processing technologies, we can conduct more in-depth research and analysis on these data, thus promoting the development of the field of Deep Learning.

The above is the detailed content of Use Scrapy crawler to capture paper data in the field of Deep Learning. For more information, please follow other related articles on the PHP Chinese website!