Gunakan sup Cantik untuk mengekstrak data apabila div tidak wujud
P粉818306280
P粉818306280 2024-02-26 16:22:47
0
1
459

Saya cuba mengekstrak data jadual daripada beberapa ribu fail html atau data tapak, tetapi jadual tidak mempunyai div untuk memudahkan ini, dan saya baru mengenali sup yang cantik. Sekarang saya sedang mengedit secara manual semua html yang ditukar kepada csv dan meletakkannya ke dalam pangkalan data saya untuk mencipta jadual, tetapi saya lebih suka mengambil apa yang saya sudah ada.

<
<body style="margin-top:140px;">    
<div id="container">
 <!-- Left div -->
 <div>
  &nbsp;
 </div>
 <!-- Center div -->
 <div>
  <!-- Image Link -->
  <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a>
 </div>
 <!-- Right div -->
 <div>
  &nbsp;
 </div>
</div>
<A Name = "Top"></A>
<H1>5k Run</H1>
<H1>Overall Finish List</H1>
<H2>September 24, 2022</H2>
<HR noshade>
<B><I> </I></B>
<HR noshade>
<table border=0 cellpadding=0 cellspacing=0 class="racetable">
  <tr>
    <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td>
  </tr>
  <tr>
    <td class=h11>Place</td>
    <td class=h12>Name</td>
    <td class=h12>City</td>
    <td class=h11>Bib No</td>
    <td class=h11>Age</td>
    <td class=h11>Gender</td>
    <td class=h11>Age Group</td>
    <td class=h11>Total Time</td>
    <td class=h11>Pace</td>
  </tr>
  <tr>
    <td class=d01>1</td>
    <td class=d02>Runner 1</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>390</td>
    <td class=d01>52</td>
    <td class=d01>M</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   18:43.93</td>
    <td class=d01>6:03/M</td>
  </tr>
  <tr>
    <td class=d01>2</td>
    <td class=d02>Runner 2</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>380</td>
    <td class=d01>33</td>
    <td class=d01>M</td>
    <td class=d01>1:19-39</td>
    <td class=d01>   19:31.27</td>
    <td class=d01>6:18/M</td>
  </tr>
  <tr>
    <td class=d01>3</td>
    <td class=d02>Runner 3</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>389</td>
    <td class=d01>65</td>
    <td class=d01>F</td>
    <td class=d01>1:Overall</td>
    <td class=d01>   45:45.20</td>
    <td class=d01>14:46/M</td>
  </tr>
  <tr>
    <td class=d01>4</td>
    <td class=d02>Runner 4</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>381</td>
    <td class=d01>18</td>
    <td class=d01>F</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   53:28.84</td>
    <td class=d01>17:15/M</td>
  </tr>
  <tr>
    <td class=d01>5</td>
    <td class=d02>Runner 5</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>382</td>
    <td class=d01>41</td>
    <td class=d01>F</td>
    <td class=d01>1:40-59</td>
    <td class=d01>   53:30.48</td>
    <td class=d01>17:16/M</td>
  </tr>
  <tr>
    <td class=d01>6</td>
    <td class=d02>Runner 6</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>384</td>
    <td class=d01>14</td>
    <td class=d01>M</td>
    <td class=d01>1: 1-18</td>
    <td class=d01>   57:38.66</td>
    <td class=d01>18:36/M</td>
  </tr>
  <tr>
    <td class=d01>7</td>
    <td class=d02>Runner 7</td>
    <td class=d02>ANYTOWN  PA</td>
    <td class=d01>385</td>
    <td class=d01>72</td>
    <td class=d01>F</td>
    <td class=d01>1:60-99</td>
    <td class=d01>   57:40.11</td>
    <td class=d01>18:36/M</td>
  </tr>
</table>
 
<HR noshade>
<p>
<!-- 0c17  22.0 2e9 -->
</BODY>
</HTML>
>

Saya telah mencuba menambah div tanpa banyak kejayaan.

P粉818306280
P粉818306280

membalas semua(1)
P粉463291248

BeautifulSoup membolehkan anda mencari di luar div.

Andaikan html yang anda paparkan ingin mendapatkan semula sesuatu yang kelihatan seperti pelari, anda boleh melakukan sesuatu seperti ini.

from bs4 import BeautifulSoup

file_path = 'scrap.html'

with open(file_path, 'r',
          encoding='utf-8') as file:  # We simulate a return from an html request by just opening an .html file
    html_content = file.read()

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {"class": "racetable"})  # We are looking for the table with the 'racetable' class
rows_table = table.find_all('tr')[1:]  # All lines in the table without the first one

columns_name = [
    row.get_text() for row in rows_table[0].find_all('td')
]  # We get the name of each column in a list

runners = []
for row in rows_table[1:]:  # We repeat on all the lines except the first one which is the one with the name of the columns
    data = [
        elem.get_text().strip() for elem in row.find_all('td')
    ]
    runner = {
        "place": data[columns_name.index("Place")],
        "name": data[columns_name.index("Name")],
        "city": data[columns_name.index("City")],
        "bib_no": data[columns_name.index("Bib No")],
        "age": data[columns_name.index("Age")],
        "gender": data[columns_name.index("Gender")],
        "age_group": data[columns_name.index("Age Group")],
        "total_time": data[columns_name.index("Total Time")],
        "pace": data[columns_name.index("Pace")]
    }
    print(runner)
    runners.append(runner)

Hasil cetakan kelihatan seperti ini

{'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN  PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
{'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN  PA', 'bib_no': '380', 'age': '33', 'gender': 'M', 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
{'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN  PA', 'bib_no': '389', 'age': '65', 'gender': 'F', 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
{'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN  PA', 'bib_no': '381', 'age': '18', 'gender': 'F', 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
{'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN  PA', 'bib_no': '382', 'age': '41', 'gender': 'F', 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
{'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN  PA', 'bib_no': '384', 'age': '14', 'gender': 'M', 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
{'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN  PA', 'bib_no': '385', 'age': '72', 'gender': 'F', 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}
Muat turun terkini
Lagi>
kesan web
Kod sumber laman web
Bahan laman web
Templat hujung hadapan