Home  >  Article  >  Backend Development  >  PHP regular expression extracts webpage hyperlink URL and webpage images

PHP regular expression extracts webpage hyperlink URL and webpage images

WBOY
WBOYOriginal
2016-07-25 08:52:571424browse
  1. function match_links($document) {
  2. preg_match_all("']+))[^>]*>?(.*?)'isx",$document,$links);
  3. while(list($key,$val) = each($links[2])) {
  4. if(!empty($val))
  5. $match['link'][] = $val;
  6. }
  7. while(list($key,$val) = each ($links[3])) {
  8. if(!empty($val))
  9. $match['link'][] = $val;
  10. }
  11. while(list($key,$val) = each($ links[4])) {
  12. if(!empty($val))
  13. $match['content'][] = $val;
  14. }
  15. while(list($key,$val) = each($links[ 0])) {
  16. if(!empty($val))
  17. $match['all'][] = $val;
  18. }
  19. return $match;
  20. }
Copy code

Mainly regular Question, here is a multi-test regularity under asp.net Get the link regularity of the page

  1. public string GetHref(string HtmlCode)
  2. {
  3. string MatchVale = "";
  4. string Reg = @"(h|H)(r|R)(e|E)(f|F) *= *('|"")?((w|\|/|.|:|-|_)+)('|""| *|>)?";
  5. foreach (Match m in Regex.Matches( HtmlCode, Reg))
  6. {
  7. MatchVale += (m.Value).ToLower().Replace("href=", "").Trim() + "||";
  8. }
  9. return MatchVale;
  10. }
Copy code

Example 2, function code for downloading remote images in content through regular expressions in PHP

A program that uses PHP regular expressions to determine the images in the content and download and save images not under this domain name. This program is actually an important part of the "thief program".

This section of the program is just the section for downloading remote images.

  1. if (preg_match_all("/http://[^ "']+[.jpg|.gif|.jpeg|.png]+/ui",stripcslashes($content),$aliurl)) {
  2. $i=0; //Multiple files++
  3. while(list($key ,$v) = each($aliurl[0])){
  4. //echo $v."
    ";
  5. $filetype = pathinfo($v, PATHINFO_EXTENSION); //Get the suffix name
  6. $ff = @file_get_contents($v); //Get the binary file content
  7. if(!stripos($v,"jbxue. com")){//Determine whether it is a picture from your own website
  8. if (!empty($ff)){ //Perform the following operations after obtaining the file
  9. $dir = "upload/".date("Ymd" )."/";//Specify a new storage path
  10. if (!file_exists($dir)){//Determine whether the directory exists
  11. @mkdir($dir,511,true); //Create a multi-level directory,511 Converted to decimal it is 777 with executable permissions
  12. } // bbs.it-home.org
  13. $nfn = $dir.date("Ymdhis").$i.".".$filetype; //Build new file Name
  14. $nf = @fopen($nfn,"w"); //Create file
  15. fwrite($nf,$ff); //Write file
  16. fclose($nf); //Close file
  17. $i++; //Multiple files++
  18. echo "";
  19. $content = str_replace($v,$nfn, $content);//Replace parameters in content
  20. }else{//If the image cannot be obtained, replace it with the default image
  21. $content = str_replace($v,"/upload/201204/20120417213810742.gif", $content);//Replace the parameters in content
  22. }
  23. }
  24. }
  25. }
Copy code

Example 3, PHP downloads images to local through regular expressions.

  1. /*

  2. shortage: If the image path in the webpage is not an absolute path, it cannot be crawled
  3. */
  4. set_time_limit(0);//The crawling is not affected Time limit

  5. $URL='http://pp.baidu.com/';//Any URL

  6. get_pic($URL);< /p>

  7. function get_pic($pic_url) {

  8. //Get the image binary stream
  9. $data=CurlGet($pic_url);
  10. /*Use regular expressions to get the image link*/
  11. $pattern_src = '/ <[img|IMG].*?src=['|"](.*?(?:[.gif|.jpg]))['|"].*?[/]?>/';
  12. $num = preg_match_all($pattern_src, $data, $match_src);
  13. $arr_src=$match_src[1];//Get the image array
  14. get_name($arr_src);

  15. echo "
    finished!!!";

  16. return 0;
  17. }

  18. /*Get the picture type and save it to the same directory as the file*/

  19. function get_name($pic_arr )
  20. {
  21. //Picture type
  22. $pattern_type = '/(/.(jpg|bmp|jpeg|gif|png))/';

  23. foreach($pic_arr as $pic_item) {//Loop to get the address of each picture

  24. $num = preg_match_all($pattern_type, $pic_item, $match_type);
  25. $pic_name = get_unique().$match_type[1][0];//Change the microsecond time Click to name
  26. //Save the picture in the form of stream
  27. $write_fd = @fopen($pic_name,"wb");
  28. @fwrite($write_fd, CurlGet($pic_item));
  29. @fclose($write_fd);
  30. echo "[OK]..!";
  31. }
  32. return 0;
  33. }

  34. //Get unique ID through microsecond time

  35. function get_unique(){
  36. list($msec, $sec ) = explode(" ",microtime());
  37. return $sec.intval($msec*1000000);
  38. }

  39. //Catch web page content

  40. function CurlGet($url) {
  41. $url=str_replace('&','&',$url);
  42. $curl = curl_init();
  43. curl_setopt($curl, CURLOPT_URL, $url);
  44. curl_setopt($curl, CURLOPT_HEADER, false);

  45. //curl_setopt($curl, CURLOPT_REFERER,$url);

  46. curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; SeaPort/1.2; Windows NT 5.1; SV1 ; InfoPath.2)");
  47. curl_setopt($curl, CURLOPT_COOKIEJAR, 'cookie.txt');
  48. curl_setopt($curl, CURLOPT_COOKIEFILE, 'cookie.txt');
  49. curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
  50. curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 0);
  51. $values ​​= curl_exec($curl);
  52. curl_close($curl);
  53. return $values;
  54. }
  55. ?>

Copy code


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn