When attempting to scrape the HTML content of a web page using cURL, it's common to encounter redirections or "page moved" errors. This can often be attributed to specially encoded characters in the query string.
To effectively retrieve the page content without encountering these issues, optimize your cURL code as follows:
<code class="php">function get_web_page($url) { $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0'; $options = array( CURLOPT_CUSTOMREQUEST => "GET", CURLOPT_POST => false, CURLOPT_USERAGENT => $user_agent, CURLOPT_COOKIEFILE => "cookie.txt", CURLOPT_COOKIEJAR => "cookie.txt", CURLOPT_RETURNTRANSFER => true, CURLOPT_HEADER => false, CURLOPT_FOLLOWLOCATION => true, CURLOPT_ENCODING => "", CURLOPT_AUTOREFERER => true, CURLOPT_CONNECTTIMEOUT => 120, CURLOPT_TIMEOUT => 120, CURLOPT_MAXREDIRS => 10, ); $ch = curl_init($url); curl_setopt_array($ch, $options); $content = curl_exec($ch); $err = curl_errno($ch); $errmsg = curl_error($ch); $header = curl_getinfo($ch); curl_close($ch); $header['errno'] = $err; $header['errmsg'] = $errmsg; $header['content'] = $content; return $header; }</code>
Retrieve and handle potential errors while reading the page:
<code class="php">$result = get_web_page($url); if ($result['errno'] != 0) // Error handling for invalid URL, timeout, or redirect loops. if ($result['http_code'] != 200) // Error handling for issues like missing page, permission denial, or unavailability. $page = $result['content'];</code>
The above is the detailed content of How to Efficiently Extract Page Content Using cURL with Error Handling?. For more information, please follow other related articles on the PHP Chinese website!