BeautifulSoup: Combine top-level text with classic tag lookup functionality?
P粉471207302
P粉471207302 2023-09-15 09:16:45
0
1
437

I'm trying to use BeautifulSoup to extract information from a non-uniformly structured html block. I'm looking for a way to combine blocks of text between tags in the search/filter output. For example, from html:

<span>
    <strong>Description</strong>
    Section1
    <ul>
        <li>line1</li>
        <li>line2</li>
        <li>line3</li>
    </ul>
    <strong>Section2</strong>
    Content2    
</span>

I want to create an output list that ignores certain types of tags (ul and li in the example above), but captures the top-level untagged text. The closest I've found is .select(':not(ul,li)') or .find_all(['strong']), but neither of them work Captures untagged top-level text and various target tags simultaneously. The ideal behavior is this:

.find_all(['strong','UNTAGGED'])

Produces the following output:

[
<strong>Description</strong>,
Section1,
<strong>Section2</strong>,
Content2
]

P粉471207302
P粉471207302

reply all(1)
P粉905144514

To get the output, you can first select and then select its next_sibling.

Example
from bs4 import BeautifulSoup
html = '''
<span>
    <strong>Description</strong>
    Section1
    <ul>
        <li>line1</li>
        <li>line2</li>
        <li>line3</li>
    </ul>
    <strong>Section2</strong>
    Content2    
</span>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('strong'):
    data.extend([e,e.next_sibling.strip()])

data
Output
[<strong>Description</strong>,
 'Section1',
 <strong>Section2</strong>,
 'Content2']
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!