轉換。 html日誌與巢狀表到。 csv文件

Question

我正在嘗試轉換一個包含多個嵌套表的html檔案。我正在轉換到.csv檔。我在該列中建立了一個新表。我想將該表轉換為明文。我正在嘗試使用漂亮的組將其轉換為Python，但沒有成功。

P粉662614213 · Answer

將帶有巢狀表的HTML檔案轉換為CSV，同時保留結構可能有點困難。 BeautifulSoup是解析HTML的一個很好的函式庫，但它可能需要額外的操作才能正確處理巢狀表。

為了獲得所需的輸出，可以使用BeautifulSoup和一些自訂Python程式碼來解析HTML、提取資料並將其正確組織為CSV格式。這裡有一個循序漸進的方法來幫助你實現這個目標:

使用BeautifulSoup解析HTML檔。

找到父表並提取其標題。
尋找父表中的所有行。
對於每一行，在相關列中找到巢狀表(如果存在)。
從巢狀表中提取數據，並將其附加到父表中的相應單元格中。

下面是一段Python程式碼片段來幫助你入門:

from bs4 import BeautifulSoup
import csv

def extract_nested_table_data(table_cell):
    # Helper function to extract the data from a nested table cell
    nested_table = table_cell.find('table')
    if not nested_table:
        return ''

    # Process the nested table and extract its data as plain text
    nested_rows = nested_table.find_all('tr')
    nested_data = []
    for row in nested_rows:
        nested_cells = row.find_all(['td', 'th'])
        nested_data.append([cell.get_text(strip=True) for cell in nested_cells])
    
    # Convert nested_data to a formatted plain text representation
    nested_text = '
'.join(','.join(row) for row in nested_data)
    return nested_text

def convert_html_to_csv(html_filename, csv_filename):
    with open(html_filename, 'r', encoding='utf-8') as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')

        parent_table = soup.find('table')
        headers = [header.get_text(strip=True) for header in parent_table.find_all('th')]

        with open(csv_filename, 'w', newline='', encoding='utf-8') as csv_file:
            csv_writer = csv.writer(csv_file)
            csv_writer.writerow(headers)

            rows = parent_table.find_all('tr')
            for row in rows[1:]:  # Skipping the header row
                cells = row.find_all(['td', 'th'])
                row_data = [cell.get_text(strip=True) for cell in cells]

                # Extract data from nested table (if it exists) and append to the row
                for idx, cell in enumerate(cells):
                    nested_data = extract_nested_table_data(cell)
                    row_data[idx] += nested_data

                csv_writer.writerow(row_data)

if __name__ == '__main__':
    html_filename = 'input.html'
    csv_filename = 'output.csv'
    convert_html_to_csv(html_filename, csv_filename)

This code assumes that your nested table data is comma-separated. If it's not, you may need to adjust the separator accordingly. Additionally, consider other delimiters if your sested table contains thatother delimiters#if your nested table contains that#. complex HTML structures may require further adjustments to this code, depending on the specifics of your data. Nonetheless, this should serve as a good starting point to tackle the task.