Jsoup clean HTML example shows how to clean HTML using Jsoup. The example also shows how to remove HTML tags from String and retain specific tags using a whitelist while cleaning the HTML using Jsoup.
How to remove HTML tags by cleaning the HTML using Jsoup?
You can remove HTML tags from String using the clean
method of the Jsoup.
1 |
static String clean(String strHTML, Whitelist whitelist) |
This method removes all HTML tags from the HTML string while retaining the tags included in the specified whitelist. By default, Jsoup provides the below-given whitelists out of the box.
1) none
All HTML tags are removed except for the text nodes.
2) simpleText
This whitelist allows only text formatting HTML tags b, em, i, strong and u. All other tags are removed.
3) basic
Basic whitelist allows a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul tags. All other tags are removed. It does not allow images.
4) basicWithImages
As the name suggests, this whitelist allows all tags included in the basic whitelist plus image (img tag).
5) relaxed
This is the most accommodating whitelist which allows a, b, blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul tags.
How to clean HTML using a whitelist?
Create an appropriate whitelist object and use it along with the clean
method to clean the HTML and retain tags specified in the whitelist as given below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
package com.javacodeexamples.libraries.jsoup; import java.io.File; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.safety.Whitelist; public class JsoupCleanHTMLExample { public static void main(String[] args) throws IOException { String strHTML = "<html>" + "<head>" + "<title>your title here</title>" + "</head>" + "<body bgcolor=\"ffffff\">" + "<center><img src=\"clouds.jpg\" align=\"bottom\"> </center>" + "<hr>" + "<a href=\"http://www.google.com\">Google</a>" + "<h1>heading 1</h1>" + "<h2>heading2</h2>" + "<p>Para tag</p>" + "<p><b>bold paragraph</b>" + "<br><b><i>bold italics text.</i></b>" + "<hr>Horizontal line" + "</body>" + "</html>"; //clean HTML using none whitelist (remove all HTML tags) String cleanedHTML = Jsoup.clean(strHTML, Whitelist.none()); System.out.println("None whitelist"); System.out.println(cleanedHTML); System.out.println(""); //clean HTML using relaxed whitelist cleanedHTML = Jsoup.clean(strHTML, Whitelist.relaxed()); System.out.println("Relaxed whitelist"); System.out.println(cleanedHTML); } } |
Output
1 2 3 4 5 6 7 8 9 10 11 12 |
None whitelist your title here Googleheading 1heading2My email link [email protected] tagbold paragraphbold italics text.Horizontal line Relaxed whitelist your title here <img align="bottom"> <a href="http://www.google.com">Google</a> <h1>heading 1</h1> <h2>heading2</h2>My email link <a href="mailto:[email protected]">[email protected]</a>. <p>Para tag</p> <p><b>bold paragraph</b><br><b><i>bold italics text.</i></b></p>Horizontal line |
How to retain specific tags while cleaning the HTML document?
Default whitelists come with pre-configured tags. What if you want to retain particular tags only and remove all other HTML tags? Whitelist provides addTags
method using which you can add as many tags as you want to retain them as given below.
1 |
public Whitelist addTags(String… tags) |
This method adds HTML tags to the whitelist.
The below example shows how to retain only <div> tags and remove all other HTML tags from the HTML String.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
String strHTML = "<html>" + "<head>" + "<title>your title here</title>" + "</head>" + "<body bgcolor=\"ffffff\">" + "<a href=\"http://www.google.com\">Google</a>" + "<h1>heading 1</h1>" + "<div>div tag content</div>" + "</body>" + "</html>"; String str = Jsoup.clean(strHTML, Whitelist.none().addTags("div")); System.out.println(str); |
Output
1 2 3 4 |
your title hereGoogleheading 1 <div> div tag content </div> |
Please also see how to remove HTML tags from a string in Java using the Jsoup example.
This example is a part of the Jsoup tutorial with examples.
Please let me know your views in the comments section below.
Hi,
Is there a solution to remove elements in a given context : bold in bold for example ?
Example : if I have :
<b>text <b>1</b><b> text</b> <b>2</b></b>
The result after cleaning should be :
<b>text 1 text 2</b>