How to remove html tags in java-javaTutorial-php.cn

Home

Java

javaTutorial

How to remove html tags in java

藏色散人

Mar 29, 2021 am 10:59 AM

java

java去掉html标签的方法：1、通过纯正则方法去掉html标签；2、使用“javax.swing.text.html.HTMLEditorKit”去掉html标签；3、通过使用Jsoup框架去掉html标签等等。

How to remove html tags in java

本文操作环境：windows7系统、Java8.0&&HTML5版，DELL G3电脑

Java去掉html标签的各种姿势

一、背景

业务开发中可能需要将html的标签全部去掉，本文将多种方法综合在这里，供大家参考。

二、方法

2.1 纯正则方法

import java.util.regex.Matcher; 
import java.util.regex.Pattern; 

public class HTMLSpirit{ 
    public static String delHTMLTag(String htmlStr){ 
        String regEx_script="<script[^>]*?>[\\s\\S]*?<\\/script>"; //定义script的正则表达式 
        String regEx_style="<style[^>]*?>[\\s\\S]*?<\\/style>"; //定义style的正则表达式 
        String regEx_html="<[^>]+>"; //定义HTML标签的正则表达式 
         
        Pattern p_script=Pattern.compile(regEx_script,Pattern.CASE_INSENSITIVE); 
        Matcher m_script=p_script.matcher(htmlStr); 
        htmlStr=m_script.replaceAll(""); //过滤script标签 
         
        Pattern p_style=Pattern.compile(regEx_style,Pattern.CASE_INSENSITIVE); 
        Matcher m_style=p_style.matcher(htmlStr); 
        htmlStr=m_style.replaceAll(""); //过滤style标签 
         
        Pattern p_html=Pattern.compile(regEx_html,Pattern.CASE_INSENSITIVE); 
        Matcher m_html=p_html.matcher(htmlStr); 
        htmlStr=m_html.replaceAll(""); //过滤html标签 

        return htmlStr.trim(); //返回文本字符串 
    } 
}

2.2 使用 javax.swing.text.html.HTMLEditorKit

import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.List;
import java.util.ArrayList;

import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.MutableAttributeSet;

public class HTMLUtils {
  private HTMLUtils() {}

  public static List<String> extractText(Reader reader) throws IOException {
    final ArrayList<String> list = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
      public void handleText(final char[] data, final int pos) {
        list.add(new String(data));
      }
      public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
      public void handleEndTag(Tag t, final int pos) {  }
      public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
      public void handleComment(final char[] data, final int pos) { }
      public void handleError(final java.lang.String errMsg, final int pos) { }
    };
    parserDelegator.parse(reader, parserCallback, true);
    return list;
  }

  public final static void main(String[] args) throws Exception{
    FileReader reader = new FileReader("java-new.html");
    List<String> lines = HTMLUtils.extractText(reader);
    for (String line : lines) {
      System.out.println(line);
    }
  }
}

【推荐：java视频教程】

2.3 使用Jsoup框架

import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.io.BufferedReader;
import org.jsoup.Jsoup;

public class HTMLUtils {
  private HTMLUtils() {}

  public static String extractText(Reader reader) throws IOException {
    StringBuilder sb = new StringBuilder();
    BufferedReader br = new BufferedReader(reader);
    String line;
    while ( (line=br.readLine()) != null) {
      sb.append(line);
    }
    String textOnly = Jsoup.parse(sb.toString()).text();
    return textOnly;
  }

  public final static void main(String[] args) throws Exception{
    FileReader reader = new FileReader
          ("C:/RealHowTo/topics/java-language.html");
    System.out.println(HTMLUtils.extractText(reader));
  }

2.4 使用Apache Tika

mport java.io.FileInputStream;
import java.io.InputStream;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

public class ParseHTMLWithTika {
  public static void main(String args[]) throws Exception {

    InputStream is = null;
    try {

         is = new FileInputStream("C:/Temp/java-x.html");
        WriteOutContentHandler contenthandler = new WriteOutContentHandler(100000000);
         Metadata metadata = new Metadata();
         Parser parser = new AutoDetectParser();
         parser.parse(is, contenthandler, metadata, new ParseContext());
         System.out.println(contenthandler.toString());
    }
    catch (Exception e) {
      e.printStackTrace();
    }
    finally {
        if (is != null) is.close();
    }
  }
}

注意这里经过本人实验有个小坑，WriteOutContentHandler参数是限制的字符数，这个如果不设置默认是1万，超过会报异常。

具体的jar包请自行到中央仓库里搜索依赖配置

https://search.maven.org/ 和 https://mvnrepository.com/

三、提供一个工具类

可以将资源路径的文本类型文件（如json/html）读取成字符串

public class ResourceUtil {
    /**
     * 根据当前类路径，获取资源文件夹对应文件的所有字符串
     *
     * @param currentClass 如 this.class
     * @param resourcePath 如 /data/json/xxx.json （相对于resources文件夹）
     */
    public static String resource2String(Class currentClass, String resourcePath) throws IOException {
        return IOUtils.toString(new FileReader(new File(currentClass.getResource(resourcePath).getFile())));
    }

}

四、总结

这里提供了多种去除html标签的方式，建议先测试好再实际使用。

测试时读取资源文件可以使用第三节提供的工具类。

如果正则表达式无法满足你的需求，自己进一步优化即可。

如果其他方式仍然有特殊情况没有考虑到，可以自己先用正则去除这种特殊情况。

总之这里只是一种参考，提供了多种解决方案。

The above is the detailed content of How to remove html tags in java. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

How to get the current date and time in Java 8?Jul 23, 2025 am 04:06 AM

In Java 8, you recommend using the classes in the java.time package; 1. Get the full date and time to use LocalDateTime.now(); 2. Get the date only with LocalDate.now(); 3. Get the time only with LocalTime.now(); 4. Format output needs to be matched with DateTimeFormatter; 5. Specify the time zone and pass the ZoneId parameters, such as ZoneId.of("Asia/Shanghai"); these are more intuitive, thread-safe and easier to use than the old Date and Calendar.

How to read Java bytecode?Jul 23, 2025 am 04:05 AM

To understand Javabytecode, you can use the JDK-provided Javap tool to disassemble the bytecode to view the bytecode; 1. Use javac to compile the class file and view the method instruction list through the javap-c command; 2. Understand the stack-based bytecode structure and the operating mechanisms of common instructions such as iconst, store, iload, iadd, etc.; 3. You can use graphical tools such as BytecodeViewer or IDE plug-in to assist in analyzing the class structure and field information; 4. Pay attention to the actual conversion form of Java syntax sugar in bytecode, such as switch string support, try-with-resources and lambda expressions. Mastering these key points is helpful

how to sort an array in javaJul 23, 2025 am 04:03 AM

A common way to sort arrays in Java is to use Arrays.sort(). For basic data type arrays, such as int[] or double[], call Arrays.sort() directly to achieve ascending sort; if you need to sort in descending order, you need to use a wrapper class (such as Integer) and pass it into the Collections.reverseOrder() comparator. String arrays are sorted in dictionary order by default, and case-insensitive sorting can be achieved through String.CASE_INSENSITIVE_ORDER. When sorting custom object arrays, you need to make the class implement a Comparable interface or provide a Comparator, for example, according to Pe

Java Blockchain Development: Smart Contracts and DLTJul 23, 2025 am 04:00 AM

To develop blockchain in Java, the focus is on smart contract interaction and distributed ledger technology (DLT) applications. 1. Although Java does not directly write smart contracts, HyperledgerFabric's Chaincode can be called through SDK (such as fabric-gateway-java); 2. Java is suitable for building intermediate-layer services based on HyperledgerFabric and Corda, supporting business logic encapsulation, permission control, etc.; 3. During development, you need to pay attention to key details such as SDK version matching, identity authentication configuration, log debugging and performance optimization. By mastering these key points, Java developers can effectively carry out their work in enterprise-level blockchain projects.

Implementing Event-Driven Architecture with Java and Apache KafkaJul 23, 2025 am 03:51 AM

Understand core components: Producers publish events to Topics, Consumers subscribe and process events, KafkaBroker manages message storage and delivery; 2. Locally build Kafka: Use Docker to quickly start ZooKeeper and Kafka services, expose port 9092; 3. Java integration Kafka: introduce kafka-clients dependencies, or use SpringKafka to improve development efficiency; 4. Write Producer: configure KafkaProducer to send JSON format order events to orders topic; 5. Write Consumer: Subscribe to o through KafkaConsumer

Surviving the Java Coding Interview: Data Structures and AlgorithmsJul 23, 2025 am 03:46 AM

Master the core data structure and its applicable scenarios, such as the selection of HashMap and TreeMap, and the expansion mechanism of ArrayList; 2. Practice algorithms from the Java perspective, proficient in double pointer, sliding window, DFS/BFS and other modes and can be clearly implemented; 3. Write clean and robust Java code, pay attention to naming, boundary processing and language features (such as generics and final); 4. Prepare the practical question of "why use Java" and understand the impact of StringBuilder, GC, etc. on performance; maintain practice and clear expression to stand out.

Implementing Event-Driven Architecture with Java KafkaJul 23, 2025 am 03:43 AM

The core points of using Java and Kafka to implement event-driven architecture include: 1. Design a clear event model, use Avro SchemaRegistry to manage structure changes, unify naming specifications and include necessary information; 2. Set reliability parameters, asynchronous sending and log callbacks when building producers, and consumers use Group to achieve expansion, control offset submission and idempotence processing; 3. Use KafkaStreams to implement real-time processing logic, such as window aggregation statistics; 4. Design an error retry mechanism, catch exceptions and retry failure messages, use DLQ to handle multiple failure events, and improve system robustness.

Mastering the Builder Design Pattern in JavaJul 23, 2025 am 03:42 AM

The Builder mode solves the problem of too many construct parameters and mutability by building complex objects in step by step; 2. When implementing, set the class to final and initialize the fields through Builder in a private construct; 3. Create a static internal Builder class, and each setting method returns this to support chain calls; 4. Verify the required fields in build() to ensure object consistency; 5. Applicable to multiple parameters, especially objects with optional parameters, to improve readability and maintenance, and avoid telescoping constructors or destroying immutability setters.

See all articles