python - beautifulsoup3.2.1中使用nextSibling方法时出现的问题
巴扎黑
巴扎黑 2017-04-17 11:43:52
0
1
799

参考了文档 然后根据他的例子做了

其中他文档中的例子是

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))
print soup.head.nextSibling.name

然后输出了body

但是如果我改成下面这个样子

from BeautifulSoup import BeautifulSoup
html = '''
            <html>
             <head>
              <title>
               Page title
              </title>
             </head>
             <body>
              <p id="firstpara" align="center">
               This is paragraph
               <b>
                one
               </b>
               .
              </p>
              <p id="secondpara" align="blah">
               This is paragraph
               <b>
                two
               </b>
               .
              </p>
             </body>
            </html>

        '''
soup = BeautifulSoup(html)
print soup.head.nextSibling.name

结果会出错,出错信息是

File "/Library/Python/2.7/site-packages/BeautifulSoup.py", line 473, in __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
AttributeError: 'NavigableString' object has no attribute 'name'

但是如果把上面代码中最后一句使用nextSibling的代码改成下面的形式

print soup.head.nextSibling.nextSibling.name

就又可以正确输出结果了,输出了body

但是一般我们爬取网页返回的都是第二种情况,然后我看了一些开源的抓取其他网页的webservice开源代码,其中也是连续用了两次nextSibling才获取下一个同级元素,想请问下各位大大为什么第二种情况就一定要连续用两次nextSibling才能获得下一个同级dom元素呢,nextSibling不是本意就是下一个同级元素,为什么此处需要用两次才能获取下一个,只用一次就会出现上面那个错误。

巴扎黑
巴扎黑

reply all(1)
大家讲道理

soup.head.nextSibling It should be that the text node behind the head tag has been obtained, right?

After testing, beautifulsoup4 only needs one .nextSibling to get the <body> element.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template