我正在嘗試使用 beautiful soup 和 FindALL 方法解析 HTML 文檔,但我似乎無法隔離我需要的資訊。我查看了文件和一些教程,也許是因為我是初級開發人員,但我似乎無法隔離數字和連結。
這是一個包含基本資訊的虛擬 HTML 表格:
<tbody> <tr class="results_row2"> <td align="left"> Text is here ispssgjj sgdhjksgd jhsgd sgd </td> <td align="left"> GHJSFAGHJSFA GAFGSH AGSHSAGJH </td> <td align="left"> hdjk sgdhjk fdhjk sdhjk sdghjk </td> <td align="center"> 11/10/1964 </td> <td align="left"> </td> <td align="center"> 5 </td> <td align="center"> <a href="javascript:confirm_delete('informatjon I need to ignore IS HERE')">Delete</a> <br> <a href="javascript:PBC('information I need to grab via parse comes from here ')">LINK TITLE</a> <br> </td> </tr> </tbody>
當我執行程式時,我需要它為每一行(即一行)提取以下內容: 日期(但重新排列為 YYMMDD,即 641110)以及顯示“LINK GOES HERE”的字串(但我必須將其與另一個字串連接以使其成為有效連結)
我不需要任何其他信息,例如連結位於此處或亂碼文字(例如 Hjkhjksgd)
編輯:我還需要能夠以正確的可信度登入網頁位置(我有密碼和使用者名稱)
希望我的程式碼夠清晰,我有一些列印來幫助我理解變數等。我也對其他方式持開放態度,我似乎無法弄清楚美麗的熊貓或硒... 到目前為止我已經得到了這個:
import requests from urllib.parse import urljoin from bs4 import BeautifulSoup #label the file location file_location = r"Destination goes here" #open the file up with open(file_location, 'r') as f: file = f.read() #create a soup soup= BeautifulSoup(file, "html.parser") #print(f"soup is {soup}") #find all the tags that match what we want script = soup.findAll('td', id='center') print('begning loop') #this is to find the date I am going to make a separate loop to find the print certificate #loop through the tags and check for what we want for i in range (0, len(script)): #these two variables are me trying to convert the tag to a variable to be used to check scriptString = str(script[i]) scriptInt = int(script[i]) #print(f'Starting loop i is: {i}') # Every 7th cell seems to be a number.... if((i+4)%7 == 0): print(f'Starting IF i is: {i}') print(f'int test is {scriptInt}') #print(f'script is {script[i]} quote end') #this was to find out which part of the string was a number and it's 80% accurate #for j in range (0, len(scriptString)): #print(f' j is {j} and string is {scriptString[j]}') #this printed the YYMMDD print(f'Rewritten the string is: "{scriptString[41]}{scriptString[42]}{scriptString[33]}{scriptString[34]}{scriptString[36]}{scriptString[37]}" quote end') print("end")
我嘗試從表中取出字串,但它看起來不是 int,而且字串非常混亂。由於字串的混亂,我無法將它與我想要的進行比較。由於存在多個 td 標籤,我無法透過 td 隔離它。
對於任何試圖做類似事情的人來說,這裡有一些帶有佔位符的簡單英語程式碼,由於這個原因,程式碼將無法按原樣編譯...非常感謝答案的幫助! ! !
''' To start this program you will need to go to Google Developer Tools while you are in the website that you want to access Go to the tab Network and right click Copy as Curl (Bash) Copy that info to curlconverter.com to get the required code ''' print('begin program') #Import libraries import requests import os from bs4 import BeautifulSoup import pdfkit import re from datetime import datetime ''' Get cookies by opening Google Chrome Develope Tools network tab, then sign in (or sign out and sign in if you are already signed in) and then click on login (probably the top one) and right click and then click copy (and as of this comment), it has an arrow and click on copy as curl (bash) Curl your website using https://curlconverter.com/ it'll generate you some python code, copy and paste it below, but move the response to below ''' # normally you will use url, but we want the welcome page to check to see if we have made a successful connection response = requests.post( 'YOUR URL GOES HERE', params=params, cookies=cookies, headers=headers, data=data, ) #store the file into soup and then parse it below to see if it goes to the login screen or welcome screen soup = BeautifulSoup(response.content, 'html.parser') #were putting a try here because soup might have been stored as a nonetype which isn't itterable so it'll also end the program if not caught try: #Check if it is stuck on the Login page to see if your cookies didn't work if "LOGIN PAGE DESCRIMINATOR " in soup.find('title'): print("ERROR cookies not getting you past the login screen you will have problems finding the PDFS") raise UnicodeError PDFArray = [First, second, etc.] tempFileLocation = r"FILE LOCATION PATH GOES HERE" totalErrors = 0 #these are the dates to search between. lowerYear = 2020 higherYear = 2023 print('Connection & Login succesful, looping through PDFs Array\n') #this loop goes through each of the Pagess in the array for i in (range(0,len(PDFArray))): #print(f'begining {PDFArray[i]}\'s loop which is {i+1} in the Array') #declare the variables for this loop grandTotalNumPDFs = grandTotalNumPDFs + totalNumPDFs PDFFromArray = PDFArray[i] #reseting these variables back to 0 for this loop, as this is now a new PDF so reset the totals etc. totalNumPDFs = 0 previousDateObj = 0 #updating the URL and folder location to draw from and save to url = f'Specific URL with your needed alterations to download PDFs' folder_location = os.path.join(r"PATH TO THE FOLDER",str(PDFFromArray)) if not os.path.exists(folder_location):os.mkdir(folder_location) #We want to pull the HTML from the website to parse response = requests.post( url, params=params, cookies=cookies, headers=headers, data=data, ) #store the HTML in the soup from the response soup = BeautifulSoup(response.content, 'html.parser') #Let the Pages Know if the connection was succesful or not print(f'For PDF:{PDFFromArray} it\'s {response.status_code==200} a connection was succesful and it\'s { the test you did above to see if it is stuck at the login page or not} that login was succesful') #to ensure that we are logging in and not stuck at the login screen if "Your login specific info for the IF statement" in soup.find('title'): print(f"ERROR cookies not getting you past the login screen you will have problems finding the PDFS") #if it comes up with a timberlake login then you want to stop processing to prevent useless PDFs raise TypeError #parse the first part by finding all the ones with the tag td and the (I cant remember what it's called) of align center script = soup.findAll("td", align="center") #this is to show the Pages stuff is happening etc. but not super useful and kinda cumbersome so commented out but used for debugging #print(f"Begining the script loop for {PDFFromArray} it is {len(script)} tags long") #loop through everything in the script to parse it for i in script: # We want to grab the date for later use. try: date_obj = datetime.strptime(i.text.strip(), "%m/%d/%Y") month = str(date_obj.month).zfill(2) # zero padding day = str(date_obj.day).zfill(2) # zero padding #Catch the value error so it doesn't crash except ValueError: pass #get rid of everything that's NOT a "a" tag a_tags = i.findAll("a") #further parsing of the "a" tag if a_tags: # parsing JavaScript for a in a_tags: pattern = r"\('(.*?)'\)" #look for the particular stuff that has the javascript in it match = re.search(pattern, a["href"]) if match: #match group 1 gets the first parenthese group e.g. in the follwoing it will pick only javascript 1 javascript(1), javascript(2) content = match.group(1) #now we want ONLY the javascript that has our text which fortuneatly has this in it if "The str to find only your content" in content: #next check to see if the certificate is inside the date range we want if(date_obj.year >= lowerYear and date_obj.year <= higherYear): #Check to see if we need to add a suffix (e.g., -1 -2 -3 etc. to the filename and if so increase it otherwise reset to 0 if(date_obj == previousDateObj): duplicateNum = duplicateNum + 1 else: duplicateNum = 0 #save the current date to check the next one previousDateObj = date_obj #increase the print certificates by one totalNumPDFs = totalNumPDFs + 1 #set the correct URL to obtain the html version of the certificate to then be converted to HTML url = f"your URL goes here " #if it's a duplicate date go ahead and add a dash and number otherwise just add the normal stuff if duplicateNum > 0: filename = f"{PDFFromArray}{str(date_obj.year)[-2:]}{month}{day}-{duplicateNum}.pdf" duplicateNum = duplicateNum + 1 else: filename = f"{PDFFromArray}{str(date_obj.year)[-2:]}{month}{day}.pdf" #set the file path file_path = os.path.join(folder_location,os.path.basename(filename)) #We've already done the cookies above so we just make a new request with the updated URL response = requests.post( url, params=params, cookies=cookies, headers=headers, data=data, ) #if the response was good go ahead otherwise print an error if(response.status_code == 200): soup = response.content #write the content to the tempFileLocation to be converted to a PDF in the next step with open(tempFileLocation, 'wb') as f: f.write(response.content) #in case a file is open or other problem try: #write the file to a PDF pdfkit.from_file(tempFileLocation, file_path) except: print(f'ERROR: Couldn\'t save {filename} because {file_path} is NOT VALID... or some other reason') totalErrors = totalErrors + 1 #this isn't a critical error so just flag it and move on pass #else if this was something besides 200 ( a valid one) print an error else: print(f"ERROR: STATUS CODE WAS NOT 200 for {PDFFromArray} with the url {url} when trying to write out the temporary file") #else if this does not have a print certificate in it, then add one to the totalNumDelete else: totalNumDelete = totalNumDelete + 1 #this is the end of the loop for the Pages, and loops to the next Pages in the array print(f'End of searching through PDF:{PDFFromArray}\'s page, {totalNumPDFs} valid PDFs were found, between the dates of {lowerYear} - {higherYear} \n') print(f'\nEnd of the program, program searched {len(PDFArray)} Pages and found, and downloaded {grandTotalNumPDFs} PDFs between the dates of {lowerYear} - {higherYear} and had {totalErrors} error(s) saving documents') ## These are to catch all the errors from the begining of the program except TypeError: print("PROGRAM END WITH ERROR: TYPE ERROR, meaning the Cookie probably is NO GOOD! or expired. and you got stuck at the login screen, and to prevent a bunch of PDFs with the login screen the program has been aborted") except: print('This is a general catch all ERROR no idea what went wrong here')
我使用了
datetime
模組和re
模組來嘗試實現您的需求,希望對您有所幫助,以下是程式碼: