I used p6ython3.6 to crawl down some data, but what was finally displayed was a list containing span tags. When I used get_text, contents, etc., an error would be reported. Why is this?
The initial results returned are as follows:
[2017.5.2] [2017.4.26] [2017.4.24] [2017.4.19] [2017.3.23] [2017.3.17] [2017.2.14] [2017.2.9] [2017.2.6] [2017.2.6]
My code is as follows:
import requests from bs4 import BeautifulSoup import re # def url_list(): # for number in range(1,21): # url_links=[] # url="X".format(i=number) # url_links.append(url) h={"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36"} r=requests.get("url",headers=h) soup=BeautifulSoup(r.text,'lxml') for data in soup.find("p",{"class":"list-main-eventset-finan"}).find_all("li"): content=data.find("i",{"class":"cell date"}).find_all("span") print(time)
I don’t remember the API of bs very clearly. There should be a function that can directly obtain the text. It should be
get_text()
这个函数吧。由于你用的是find_all()
. Then I need to traverse the returned result again, that’s itIn addition, you can also use regular expressions to match directly
(.*?)<
this pattern. But you have to traverse the contens list as above.The questioner can try the
text_content()
methodRegular expressions or split+SUBSTRING can also be used, use them flexibly