Konvertieren. HTML-Protokoll mit verschachtelten Tabellen zu. csv-Datei

Question

Ich versuche, eine HTML-Datei zu konvertieren, die mehrere verschachtelte Tabellen enthält. Ich konvertiere in eine CSV-Datei. Ich habe in dieser Spalte eine neue Tabelle erstellt. Ich möchte diese Tabelle in Klartext umwandeln. Ich versuche, dies mit hübschen Gruppen in Python zu konvertieren, aber ohne Erfolg.

P粉662614213 · Answer

将带有嵌套表的HTML文件转换为CSV，同时保留结构可能有点困难。BeautifulSoup是解析HTML的一个很好的库，但它可能需要额外的操作才能正确处理嵌套表。

为了获得所需的输出，可以使用BeautifulSoup和一些自定义Python代码来解析HTML、提取数据并将其正确组织为CSV格式。这里有一个循序渐进的方法来帮助你实现这一目标:

使用BeautifulSoup解析HTML文件。

找到父表并提取其标题。
查找父表中的所有行。
对于每一行，在相关列中找到嵌套表(如果存在)。
从嵌套表中提取数据，并将其附加到父表中的相应单元格中。

下面是一段Python代码片段来帮助你入门:

from bs4 import BeautifulSoup
import csv

def extract_nested_table_data(table_cell):
    # Helper function to extract the data from a nested table cell
    nested_table = table_cell.find('table')
    if not nested_table:
        return ''

    # Process the nested table and extract its data as plain text
    nested_rows = nested_table.find_all('tr')
    nested_data = []
    for row in nested_rows:
        nested_cells = row.find_all(['td', 'th'])
        nested_data.append([cell.get_text(strip=True) for cell in nested_cells])
    
    # Convert nested_data to a formatted plain text representation
    nested_text = '
'.join(','.join(row) for row in nested_data)
    return nested_text

def convert_html_to_csv(html_filename, csv_filename):
    with open(html_filename, 'r', encoding='utf-8') as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')

        parent_table = soup.find('table')
        headers = [header.get_text(strip=True) for header in parent_table.find_all('th')]

        with open(csv_filename, 'w', newline='', encoding='utf-8') as csv_file:
            csv_writer = csv.writer(csv_file)
            csv_writer.writerow(headers)

            rows = parent_table.find_all('tr')
            for row in rows[1:]:  # Skipping the header row
                cells = row.find_all(['td', 'th'])
                row_data = [cell.get_text(strip=True) for cell in cells]

                # Extract data from nested table (if it exists) and append to the row
                for idx, cell in enumerate(cells):
                    nested_data = extract_nested_table_data(cell)
                    row_data[idx] += nested_data

                csv_writer.writerow(row_data)

if __name__ == '__main__':
    html_filename = 'input.html'
    csv_filename = 'output.csv'
    convert_html_to_csv(html_filename, csv_filename)

This code assumes that your nested table data is comma-separated. If it's not, you may need to adjust the separator accordingly. Additionally, consider other delimiters if your nested table contains commas.

Remember that handling complex HTML structures may require further adjustments to this code, depending on the specifics of your data. Nonetheless, this should serve as a good starting point to tackle the task.