Home Java javaTutorial How Can I Preserve Line Breaks When Converting HTML to Plain Text Using Jsoup?

How Can I Preserve Line Breaks When Converting HTML to Plain Text Using Jsoup?

Oct 30, 2024 pm 11:24 PM

How Can I Preserve Line Breaks When Converting HTML to Plain Text Using Jsoup?

Preserving Line Breaks Using Jsoup: A Comprehensive Guide

When converting HTML to plain text, preserving line breaks is crucial to maintain readability. Jsoup, a popular Java HTML parser library, provides an efficient way to extract text from HTML while retaining its structure.

In this guide, we will delve into the specific issue of preserving line breaks when using Jsoup's Jsoup.parse(str).text() method. This method extracts the text content from HTML, but it does not natively preserve line breaks.

Utilizing TextNode.getWholeText()

Initially, the question explored the possibility of using Jsoup's TextNode.getWholeText() method. However, this approach proved ineffective as it does not handle line breaks in the context of HTML tags.

The Effective Solution

The solution to preserving line breaks lies in a more comprehensive approach that involves both pre- and post-processing of the HTML content before extracting the text.

The presented code snippet takes the following steps:

  1. Parses the HTML string using Jsoup.
  2. Disables HTML pretty printing to ensure line breaks are preserved.
  3. Adds line breaks (n) at the end of
    tags and before

    tags.

  4. Replaces the sequence n with actual newlines.
  5. Cleans the modified HTML to remove any remaining formatting or tags.

Implementation

<code class="java">public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\n");
    document.select("p").prepend("\n\n");
    String s = document.html().replaceAll("\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}</code>

Satisfied Requirements

The provided solution fulfills the following requirements:

  • Preserves existing newlines (n) in the HTML.
  • Converts
    and

    tags into newlines.

  • Removes any unwanted formatting or tags in the resulting text.

By implementing this solution, you can effectively preserve line breaks when converting HTML to plain text using Jsoup, ensuring accurate and readable results.

The above is the detailed content of How Can I Preserve Line Breaks When Converting HTML to Plain Text Using Jsoup?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress AI Tool

Undress images for free

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Beginner's Guide to RimWorld: Odyssey
1 months ago By Jack chen
PHP Variable Scope Explained
3 weeks ago By 百草
Tips for Writing PHP Comments
3 weeks ago By 百草
Commenting Out Code in PHP
3 weeks ago By 百草

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

PHP Tutorial
1508
276
How does a HashMap work internally in Java? How does a HashMap work internally in Java? Jul 15, 2025 am 03:10 AM

HashMap implements key-value pair storage through hash tables in Java, and its core lies in quickly positioning data locations. 1. First use the hashCode() method of the key to generate a hash value and convert it into an array index through bit operations; 2. Different objects may generate the same hash value, resulting in conflicts. At this time, the node is mounted in the form of a linked list. After JDK8, the linked list is too long (default length 8) and it will be converted to a red and black tree to improve efficiency; 3. When using a custom class as a key, the equals() and hashCode() methods must be rewritten; 4. HashMap dynamically expands capacity. When the number of elements exceeds the capacity and multiplies by the load factor (default 0.75), expand and rehash; 5. HashMap is not thread-safe, and Concu should be used in multithreaded

how to set JAVA_HOME environment variable in windows how to set JAVA_HOME environment variable in windows Jul 18, 2025 am 04:05 AM

TosetJAVA_HOMEonWindows,firstlocatetheJDKinstallationpath(e.g.,C:\ProgramFiles\Java\jdk-17),thencreateasystemenvironmentvariablenamedJAVA_HOMEwiththatpath.Next,updatethePATHvariablebyadding%JAVA\_HOME%\bin,andverifythesetupusingjava-versionandjavac-v

Java Virtual Threads Performance Benchmarking Java Virtual Threads Performance Benchmarking Jul 21, 2025 am 03:17 AM

Virtual threads have significant performance advantages in highly concurrency and IO-intensive scenarios, but attention should be paid to the test methods and applicable scenarios. 1. Correct tests should simulate real business, especially IO blocking scenarios, and use tools such as JMH or Gatling to compare platform threads; 2. The throughput gap is obvious, and it can be several times to ten times higher than 100,000 concurrent requests, because it is lighter and efficient in scheduling; 3. During the test, it is necessary to avoid blindly pursuing high concurrency numbers, adapting to non-blocking IO models, and paying attention to monitoring indicators such as latency and GC; 4. In actual applications, it is suitable for web backend, asynchronous task processing and a large number of concurrent IO scenarios, while CPU-intensive tasks are still suitable for platform threads or ForkJoinPool.

How to handle transactions in Java with JDBC? How to handle transactions in Java with JDBC? Aug 02, 2025 pm 12:29 PM

To correctly handle JDBC transactions, you must first turn off the automatic commit mode, then perform multiple operations, and finally commit or rollback according to the results; 1. Call conn.setAutoCommit(false) to start the transaction; 2. Execute multiple SQL operations, such as INSERT and UPDATE; 3. Call conn.commit() if all operations are successful, and call conn.rollback() if an exception occurs to ensure data consistency; at the same time, try-with-resources should be used to manage resources, properly handle exceptions and close connections to avoid connection leakage; in addition, it is recommended to use connection pools and set save points to achieve partial rollback, and keep transactions as short as possible to improve performance.

Implement a linked list in Java Implement a linked list in Java Jul 20, 2025 am 03:31 AM

The key to implementing a linked list is to define node classes and implement basic operations. ①First create the Node class, including data and references to the next node; ② Then create the LinkedList class, implementing the insertion, deletion and printing functions; ③ Append method is used to add nodes at the tail; ④ printList method is used to output the content of the linked list; ⑤ deleteWithValue method is used to delete nodes with specified values and handle different situations of the head node and the intermediate node.

How to format a date in Java with SimpleDateFormat? How to format a date in Java with SimpleDateFormat? Jul 15, 2025 am 03:12 AM

Create and use SimpleDateFormat requires passing in format strings, such as newSimpleDateFormat("yyyy-MM-ddHH:mm:ss"); 2. Pay attention to case sensitivity and avoid misuse of mixed single-letter formats and YYYY and DD; 3. SimpleDateFormat is not thread-safe. In a multi-thread environment, you should create a new instance or use ThreadLocal every time; 4. When parsing a string using the parse method, you need to catch ParseException, and note that the result does not contain time zone information; 5. It is recommended to use DateTimeFormatter and Lo

Java Microservices Service Mesh Integration Java Microservices Service Mesh Integration Jul 21, 2025 am 03:16 AM

ServiceMesh is an inevitable choice for the evolution of Java microservice architecture, and its core lies in decoupling network logic and business code. 1. ServiceMesh handles load balancing, fuse, monitoring and other functions through Sidecar agents to focus on business; 2. Istio Envoy is suitable for medium and large projects, and Linkerd is lighter and suitable for small-scale trials; 3. Java microservices should close Feign, Ribbon and other components and hand them over to Istiod for discovery and communication; 4. Ensure automatic injection of Sidecar during deployment, pay attention to traffic rules configuration, protocol compatibility, and log tracking system construction, and adopt incremental migration and pre-control monitoring planning.

Advanced Java Collection Framework Optimizations Advanced Java Collection Framework Optimizations Jul 20, 2025 am 03:48 AM

To improve the performance of Java collection framework, we can optimize from the following four points: 1. Choose the appropriate type according to the scenario, such as frequent random access to ArrayList, quick search to HashSet, and concurrentHashMap for concurrent environments; 2. Set capacity and load factors reasonably during initialization to reduce capacity expansion overhead, but avoid memory waste; 3. Use immutable sets (such as List.of()) to improve security and performance, suitable for constant or read-only data; 4. Prevent memory leaks, and use weak references or professional cache libraries to manage long-term survival sets. These details significantly affect program stability and efficiency.

See all articles