search_crawling-php.cn

Whole station
Course
Article
Q&A
Download

Detailed Tutorial: Crawling GitHub Repository Folders Without API

Course Introduction：Ultra-Detailed Tutorial: Crawling GitHub Repository Folders Without API This ultra-detailed tutorial, authored by Shpetim Haxhiu, walks you through crawling GitHub repository folders programmatically without relying on the GitHub API. It includ

2024-12-16 comment 0 1265

Multithreaded web link crawling with Scrapy

Course Introduction：This article aims to provide a tutorial on multithreaded web link crawling using the Scrapy framework. We will explore how to simplify the web crawling process with Scrapy and provide a straight-running sample code that can crawl all links from a specified URL and save the results to a CSV file. This article will also briefly introduce Scrapy's LinkExtractor and CrawlSpider classes to help readers gain a deeper understanding of the power of Scrapy.

2025-09-12 comment 0 951

How to efficiently handle timed data crawling: the best strategy for deduplication and data filling?

Course Introduction：Efficient processing of timed data crawling: Deduplication and data filling strategy This article discusses the solution of timed data crawling and deduplication and data filling, and...

2025-04-01 comment 0 1141

Coping with dynamic content crawling: Google CSE API application in Dermnet image crawling

Course Introduction：This article aims to solve the problem that traditional BeautifulSoup or Selenium methods fail when crawling pictures from websites such as Dermnet that use JavaScript to dynamically load content. By deeply analyzing network requests in browser developer tools, we found that such websites often obtain image data through the Google Custom Search Engine (CSE) API. The tutorial will provide detailed guidance on how to identify and directly request the API interface, parse the returned JSON data to efficiently extract the image URL, and explore the paging processing mechanism to achieve accurate crawling of dynamically loaded images.

2025-10-04 comment 0 689

Dynamic web crawling using API and Selenium: Taking Naver Comics as an example

Course Introduction：This article aims to solve the IndexError: list index out of range issue encountered when crawling Naver comic information using BeautifulSoup. Since the landing web content is dynamically generated through JavaScript, the traditional static crawling method is invalid. This article will introduce how to obtain data through the analysis API interface, and how to use Selenium to simulate browser behavior for dynamic content crawling, and provide corresponding Python code examples.

2025-08-18 comment 0 431

MoreTechnical Articles

Scala Tutorial

Course Elementary 13955

Course Introduction：Scala Tutorial Scala is a multi-paradigm programming language, designed to integrate various features of object-oriented programming and functional programming.

CSS Online Manual

Course Elementary 82523

Course Introduction："CSS Online Manual" is the official CSS online reference manual. This CSS online development manual contains various CSS properties, definitions, usage methods, example operations, etc. It is an indispensable online query manual for WEB programming learners and developers! CSS: Cascading Style Sheets (English full name: Cascading Style Sheets) is an application used to express HTML (Standard Universal Markup Language).

SVG Tutorial

Course Elementary 13294

Course Introduction：SVG is a markup language for vector graphics in HTML5. It maintains powerful drawing capabilities and at the same time has a very high-end interface to operate graphics by directly operating Dom nodes. This "SVG Tutorial" is intended to allow students to master the SVG language and some of its corresponding APIs, combined with the knowledge of 2D drawing, so that students can render and control complex graphics on the page.

AngularJS Chinese Reference Manual

Course Elementary 24774

Course Introduction：In the "AngularJS Chinese Reference Manual", AngularJS extends HTML with new attributes and expressions. AngularJS can build a single page application (SPAs: Single Page Applications). AngularJS is very easy to learn.

Go language tutorial manual

Course Elementary 27599

Course Introduction：Go is a new language, a concurrent, garbage-collected, fast-compiled language. It can compile a large Go program in a few seconds on a single computer. Go provides a model for software construction that makes dependency analysis easier and avoids most C-style include files and library headers. Go is a statically typed language, and its type system has no hierarchy. Therefore users do not need to spend time defining relationships between types, which feels more lightweight than typical object-oriented languages. Go is a completely garbage-collected language and provides basic support for concurrent execution and communication. By its design, Go is intended to provide a method for constructing system software on multi-core machines.

More courses

How to prevent malicious ddos crawling on nginx

First of all, I have no objection to others crawling the content of my website. I don’t necessarily strictly limit other people’s crawling. However, some people’s crawling has no bottom line at all. They use one script or even multiple scripts to crawl a certain website concurrently. The content of the server is no different from ddos. My server is currently experiencing this...

2017-05-16 17:30:17

python - pyspider scheduled crawling problem

When writing the crawler, I found that after setting every in the code, after crawling once on the 21st, I saw that the result was not updated today, and the lastcrawltime was still on the 21st. Is my parameter setting incorrect?

2017-05-18 10:53:29

javascript - Problem with crawling web page Jquery selector first-child

When crawling a website, I feel that h2 and h3 have the same structure. Why can h2:first-child get data, but h3 cannot? The final results h2_1 and h2_2 are the same, no problem. h3_1 is OK, but h3_2 is empty. Why is this? The code is as follows, {code...}

2017-05-16 13:28:41

How to implement interfaceless crawling using python + selenium + chromedriver

In the process of using selenium to crawl 12306, I found that phantomjs cannot be used to crawl, and chromedriver can be used. It should be that phantomjs is detected and banned by the website. Using chromedriver will display the interface again, and the crawling efficiency is low. Now I have two questions. I have googled them for a long time but I still can’t find them...

2017-05-18 10:53:13

Python multi-threaded crawling files, how to set timeout and reconnection.

When using python to crawl data, enable multi-thread crawling in a single process. After all, I don’t have multiple processes because of intensive IO. The code is as follows {code...} However, as long as a thread's requests do not return a value, the thread will keep waiting and will not write, so there will be a problem that the main process is not blocked...

2017-05-18 11:02:31

MoreQ&A