• 技术文章 >后端开发 >XML/RSS教程

    使用Ruby和Nokogiri模拟爬虫导出RSS种子的实例详解

    Y2JY2J2017-05-02 09:42:48原创1384
    # encoding: utf-8
    require 'thread'
    require 'nokogiri'
    require 'open-uri'
    require 'rss/maker'
     
    $result=Queue.new
    def extract_readme_header(no,name,url)
      frame = Nokogiri::HTML(open(url))
      return unless frame
      readme=$url+frame.css('frame')[1]['src']
      return unless readme
      open(readme) do |f|
        doc = Nokogiri::HTML(f.read)
        text=doc.css("div#content div#filecontents p")[0..4].map { |c| c.content }.join(" ").strip
        return if text.length==0
        if text !~ /(rails)|(activ_)/i
          puts "========= #{no} #{name} : #{text[0..50]}"
          date = f.last_modified
          $result << [no,name,readme,date,text]
        end
      end
    rescue
      puts $!.to_s
    end
     
    def make_rss(items)
      RSS::Maker.make("2.0") do |m|
        m.channel.title = "GtitHub recently updated projects"
        m.channel.link = "http://localhost"
        m.channel.description = "GitHub recently updated projects"
        m.items.do_sort = true
        items.each do |no,name,url,date,descr|
          i = m.items.new_item
          i.title = name
          i.link = url
          i.description=descr
          i.date = date
        end
      end
    end
     
    ############################## M A I N ########################
     
    ############# Scan list of recent project
     
    lth=[]
    $url="http://rdoc.info"
    puts "get url #{$url}..."
    doc = Nokogiri::HTML(open($url))
    doc.css('ul.libraries')[1].css('li').each_with_index do |li,i|
      aname =li.css('a').first
      name=aname.content
      purl=$url+aname['href']
      lth << Thread.new(i,name,purl) { |j,n,u| extract_readme_header(j,n,u)  }
    end
     
    ################ wait all readme are read
     
    lth.each { |th| th.join() }
     
    ################ dequeue results and sort them by date descending
     
    result=[]
    result << $result.shift while $result.size>0
    result.sort!  { |a,b| a[0] <=> b[0] }
     
     
    ################ format results in rss
     
    File.open("RubyFeeds.rss","w") do |file|
      file.write make_rss(result)
    end

    以上就是使用Ruby和Nokogiri模拟爬虫导出RSS种子的实例详解的详细内容,更多请关注php中文网其它相关文章!

    声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn核实处理。
    上一篇:XmlSlurper解析RSS的实例代码 下一篇:使用FeedTools解析RSS代码示例
    VIP课程(WEB全栈开发)

    相关文章推荐

    • 【活动】充值PHP中文网VIP即送云服务器• 四种XML解析方式详解• 浅谈WEB页面工具语言XML(六)展望• 详细介绍xml的语法的使用和学习• XML教程-XML的用途介绍• XML教程-用一个例子来学习XML的语法的详情
    1/1

    PHP中文网