Java HTML Parse Jsoup getPlainTextFromHtml(String html)

Here you can find the source of getPlainTextFromHtml(String html)

Description

Extracts plain text from given HTML String.

License

Open Source License

Parameter

Parameter Description
html HTML String to be used for plain text extraction

Return

Extracted plain text

Declaration

public static String getPlainTextFromHtml(String html) 

Method Source Code


//package com.java2s;
//License from project: Open Source License 

import org.jsoup.Jsoup;

public class Main {
    /**//ww  w  . j a  va 2 s .  com
     * Extracts plain text from given HTML String.
     *
     * @param html HTML String to be used for plain text extraction
     * @return Extracted plain text
     */
    public static String getPlainTextFromHtml(String html) {
        // keep line breaks of b-tags after other tags
        html = html.replaceAll("(?i)>\\s*\\n*\\s*<b>", "><br><b>");

        // replace b-tags with space to prevent line breaks
        html = html.replaceAll("(?i)\\s*\\n*\\s*</?b>\\s*", " ");

        // replace br-tags and line breaks with placeholder
        String breakTagPlaceholder = "%BREAK%";
        html = html.replaceAll("(?i)(<br[^>]*>|\\n)", breakTagPlaceholder);

        // replace p-tags with placeholder
        html = html.replaceAll("(?i)(<p>|\\n)", breakTagPlaceholder);

        // clean html
        html = Jsoup.parse(html).text();

        // replace non-breaking space with normal whitespace
        html = html.replaceAll("\\u00A0", " ");

        // replace placeholder with real line breaks
        html = html.replaceAll(breakTagPlaceholder, "\n");

        return html;
    }
}

Related

  1. getHtmlInTag(String html, String tag)
  2. getImageCredit(String html)
  3. getJSFileLinks(String html)
  4. getMetaValue(String html, String metaKey)
  5. getPlainText(String htmlText)
  6. getTitle(String htmlContent)
  7. htmlArray2textArray(List htmlArray)
  8. isHTMLEmpty(String textToCheck)
  9. parse(final String html)