Python program to extract strings between HTML tags

WBOY

Release： 2023-08-19 09:37:19

forward

1717 people have browsed it

Python program to extract strings between HTML tags

HTML tags are used to design the framework of the website. We pass information and upload content in the form of strings contained in tags. The string between HTML tags determines how the element is displayed and interpreted by the browser. Therefore, extracting these strings plays a vital role in data manipulation and processing. We can analyze and understand the structure of HTML documents.

These strings reveal the hidden patterns and logic behind building web pages. In this article, we will deal with these strings. Our task is to extract strings between HTML tags.

Understanding Questions

We need to extract all strings between HTML tags. Our target string is surrounded by different types of tags and only the content part should be retrieved. Let us understand this problem through an example.

Input and output scenarios

Let us consider a string -

Input:
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"

Copy after login

The input string consists of different HTML tags, and we need to extract the string between them.

Output: [" This is a test string,  Let's code together "]

Copy after login

As we can see, the "

" and "
" tags are removed and the string is extracted. Now that we understand the problem, let's discuss a few solutions.

Use iteration and replace()

This method focuses on eliminating and replacing HTML tags. We will pass a string and a list of different HTML tags. Afterwards, we will initialize this string to an element of the list.

We will loop through each element in the tag list and check if it exists in the original string. We will pass a "pos" variable which will store the index value and drive the iteration process.

We will use the "replace()" method to replace each tag with a space and get a string without HTML tags.

The Chinese translation of

Example

is:

Example

The following is an example for extracting strings between HTML tags -

Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["<h1>", "</h1>", "<p>", "</p>", "<b>", "</b>", "<br>"]
print(f"This is the original string: {Inp_STR}")
ExStr = [Inp_STR]
pos = 0

for tag in tags:
   if tag in ExStr[pos]:
      ExStr[pos] = ExStr[pos].replace(tag, " ")
pos += 1

print(f"The extracted string is : {ExStr}")

Copy after login

Output

This is the original string: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is : [" This is a test string,  Let's code together "]

Copy after login

Use regular expression module findall()

In this method, we will use the regular expression module to match a specific pattern. We will pass a regular expression: "<" tag ">(.*?)" which represents the target pattern. This mode is designed to capture opening and closing tags. Here, "tag" is a variable whose value is obtained from the tag list by iteration.

The "findall()" function is used to find all occurrences of a pattern in a raw string. We will use the "extend()" method to add all "matches" to a new list. In this way, we will extract the string contained in the HTML tag.

The Chinese translation of

Example

is:

Example

The following is an example -

import re
Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
print(f"This is the original string: {Inp_STR}")
ExStr = []

for tag in tags:
   seq = "<"+tag+">(.*?)</"+tag+">"
   matches = re.findall(seq, Inp_STR)
   ExStr.extend(matches)
print(f"The extracted string is: {ExStr}")

Copy after login

Output

This is the original string: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]

Copy after login

Using iteration and find() function

In this method, we will use the "find()" method to get the first occurrence of the opening and closing tags in the original string. We will iterate through each element in the tag list and retrieve its position in the string.

A While loop will be used to continue searching for HTML tags in the string. We will build a condition to check if there are incomplete tags in the string. On each iteration, the index value will be updated to find the next occurrence of opening and closing tags.

The index values of all opening and closing tags are stored, and once the entire string is mapped, we use string slicing to extract the string between HTML tags.

The Chinese translation of

Example

is:

Example

The following is an example -

Inp_STR = "<h1>This is a test string,</h1><p>Let's code together</p>"
tags = ["h1", "p", "b", "br"]
ExStr = []
print(f"The original string is: {Inp_STR}")

for tag in tags:
   tagpos1 = Inp_STR.find("<"+tag+">")
   while tagpos1 != -1:
      tagpos2 = Inp_STR.find("</"+tag+">", tagpos1)
      if tagpos2 == -1:
         break
      ExStr.append(Inp_STR[tagpos1 + len(tag)+2: tagpos2])
      tagpos1 = Inp_STR.find("<"+tag+">", tagpos2)

print(f"The extracted string is: {ExStr}")

Copy after login

Output

The original string is: <h1>This is a test string,</h1><p>Let's code together</p>
The extracted string is: ['This is a test string,', "Let's code together"]

Copy after login

in conclusion

In this article, we have discussed many ways to extract strings between HTML tags. Let's start with a simpler solution, locating and replacing tags with spaces. We also used the regular expression module and its findall() function to find matching patterns. We also learned about the find() method and the application of string slicing.

The above is the detailed content of Python program to extract strings between HTML tags. For more information, please follow other related articles on the PHP Chinese website!