How do I build a simple PHP crawler to extract links and content from a website?-PHP Tutorial-php.cn

How do I build a simple PHP crawler to extract links and content from a website?

Linda Hamilton

Release： 2024-11-07 19:04:02

Original

931 people have browsed it

How do I build a simple PHP crawler to extract links and content from a website?

Creating a Simple PHP Crawler

Crawling websites and extracting data is a common task in web programming. PHP provides a flexible framework for building crawlers, allowing you to access and retrieve information from remote web pages.

To create a simple PHP crawler that collects links and content from a given web page, you can utilize the following approach:

Using a DOM Parser:

<?php
function crawl_page($url, $depth = 5)
{
    // Prevent endless recursion and circular references
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    // Mark the URL as seen
    $seen[$url] = true;

    // Load the web page using DOM
    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    // Iterate over all anchor tags (<a>)
    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');

        // Convert relative URLs to absolute URLs
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) &amp;&amp; isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1) . $path;
            }
        }

        // Recursively crawl the linked page
        crawl_page($href, $depth - 1);
    }

    // Output the crawled page's URL and content
    echo "URL: " . $url . PHP_EOL . "CONTENT: " . PHP_EOL . $dom->saveHTML() . PHP_EOL . PHP_EOL;
}
crawl_page("http://example.com", 2);
?>

Copy after login

This crawler uses a DOM parser to navigate through the web page's HTML, identifies all anchor tags, and follows any links they contain. It collects the content of the linked pages and dumps it into the standard output. You can redirect this output to a text file to save the collected data locally.

Additional Features:

Prevents crawling the same URL multiple times.
Handles relative URLs correctly.
Supports HTTPS, user authentication, and port numbers when using the http PECL extension.

The above is the detailed content of How do I build a simple PHP crawler to extract links and content from a website?. For more information, please follow other related articles on the PHP Chinese website!