Jsoup tutorial with examples will help you understand how to use Jsoup in an easy way. In this Jsoup tutorial, I will show you how web scraping was never been easier using Jsoup examples. Jsoup is an open-source library for parsing HTML content and web scraping which is distributed under MIT license. That means you are free to download, use and distribute it.
Why you should use the Jsoup instead of regular expressions for web scraping?
The real-world HTML content may not be well-formed, for example, some programmers choose to write <br>
while others prefer <br />
for line breaks in HTML pages. In this situation, parsing the HTML using regular expression will not yield the desired results or becomes too complicated. Plus, it will be very error-prone and resource-intensive to write all such combinations for parsing HTML content.
All these problems can be easily avoided by using an HTML parser like Jsoup instead of trying to parse the content using regular expressions.
Below given are some of the main capabilities of the Jsoup parser.
- Jsoup can parse HTML directly from URL, from file or even from the String variable.
- Jsoup allows HTML element structure manipulation like adding, changing or removing elements. It also allows adding and removing attributes easily.
- Finding data in elements or attributes is very easy using Jsoup.
- Jsoup supports basic authentication using a user name and password.
- If you are behind the proxy, no problem! Jsoup works with proxy as well.
- Jsoup supports cleaning the HTML. You can specify what tags you want to retain in the parsed HTML using the whitelist.
- Jsoup can output tidy HTML from the parsed HTML.
These are some of the main features of the Jsoup. It provides many other features that are very useful in real-world scenarios. Plus, selecting an element from Jsoup parsed HTML is very easy as it supports jquery styled selectors. For example, to select all td elements from all the table rows of an HTML document, you can write a selector like document.select("table tr td")
which returns all the matching td elements.
How to download and use the Jsoup in your project?
You can download the binary distribution (Jsoup jar file) directly from the download section of the Jsoup website. Once you download the library, put it in your build path to start using it. If you use Maven in your project, mention the following Jsoup maven dependency.
Jsoup Maven:
1 2 3 4 5 |
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.12.1</version> </dependency> |
Jsoup Gradle:
1 |
compile 'org.jsoup:jsoup:1.12.1' |
Jsoup does not have any other dependencies.
How to parse HTML using Jsoup?
Jsoup is capable of scraping and parsing HTML content from a file, a URL, or string. I will show you each one.
How to parse HTML from a URL using Jsoup?
Use the connect
method of the Jsoup class to connect to a URL and get
method to get and parse HTML from the given URL.
1 2 3 4 5 6 7 8 9 10 11 12 |
try{ //Get content the google home page using Jsoup Document document = Jsoup.connect("http://www.google.com").get(); //get the webpage text System.out.println( document.text() ); }catch(IOException ioe){ System.out.println("Unable to connect to the URL"); ioe.printStackTrace(); } |
Output
1 |
Google Search Images Maps Play YouTube News Gmail Drive More » Web History | Settings | Sign in Advanced searchLanguage... |
I have truncated the above given output.
How to parse HTML from a file (local file)?
If you have a local file containing the HTML and you want to parse it, you can use the parse
method of the Jsoup class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
try{ //location of the local html file File htmlFile = new File("E:/example-html-file.html"); //Get content of the local file using Jsoup in UTF-8 encoding Document document = Jsoup.parse(htmlFile, "UTF-8"); System.out.println( document.text() ); }catch(IOException ioe){ System.out.println("Unable to parse local html file"); ioe.printStackTrace(); } |
The above given parse
method uses the location of the file to resolve any relative URLs given in the HTML file. For example, if you have downloaded the HTML content of the domain http://www.example.com and it contains a reference to an image like ‘/favicon.ico’ using the relative URLs. Now if you want to download this image while parsing the HTML, you need an absolute URL of the image. If you have downloaded that image and placed it at the same location as the HTML file, it is fine.
However, if you just downloaded an HTML file and you need to fetch all the other resources from the domain, you need to use below given overloaded parse
method with baseURI parameter.
1 |
public static Document parse(File file, String charsetName, String baseURI) |
Now all the relative URLs found in the HTML document will be considered relative to the mentioned baseURI.
How to parse HTML from a String?
If you want to parse HTML from a Java String, use the parse
method having a String argument.
1 2 3 4 5 6 |
String strHTML = "<html><body>Java Web Scraping</body></html>"; //parse html from Java String variable Document document = Jsoup.parse(strHTML); System.out.println( document.text() ); |
Output
1 |
Java Web Scraping |
Again, as given above, you can use the overloaded parse
method having string content and baseURI parameters to resolve any relative URLs given in the string HTML.
Understanding the Jsoup Connection, Request, and Response
The Connection interface of the Jsoup package provides methods for connecting and fetching URLs, executing GET and POST requests, and getting the Request and Response objects. All the configuration related to HTTP requests needs to be configured using the Connection.
Below given are some of the basic HTTP configurations you can do with the Jsoup Connection.
How to follow server redirects?
If you request a webpage that has been moved to another location, the server sends HTTP 301 or 302 redirect response specifying the new location of the webpage. The Jsoup connection follows the server redirects by default to fetch the requested document from the new URL. If you want to turn it off, use the followRedirects
method and pass false.
1 |
Connection followRedirects(boolean followRedirects) |
Example:
1 2 |
Connection connection = Jsoup.connect("http://www.example.com") .followRedirects(false); |
How to send request headers?
Use the header
method to set the request header.
1 |
Connection header(String headerName, String headerValue) |
Example:
1 2 3 |
Connection connection = Jsoup.connect("http://www.google.com") .header("header1", "value1") .header("header2", "value2"); |
The above example sends header1 and header2 headers while requesting the URL. If you have multiple headers stored in a Map object, you can use the headers
method to specify all the headers at once instead of invoking the header
method multiple times as given below.
1 2 3 4 5 6 7 8 9 |
//map containing all the request headers Map<String, String> headerMap = new HashMap<String, String>(); headerMap.put("header1", "value1"); headerMap.put("header2", "value2"); //specify all the headers at once using the headers method Connection connection = Jsoup.connect("http://www.google.com") .headers(headerMap); |
How to ignore the document’s Content-type?
Jsoup takes the document’s content type in to account while parsing the response to prevent IOException for unrecognized content types. If you want to parse the response regardless of the document’s content type, use the ignoreContentType
method and pass true (default is false).
1 |
Connection ignoreContentType(boolean ignoreContentType) |
Example:
1 2 3 |
//this will ignore document's content type while parsing Connection connection = Jsoup.connect("http://www.example.com") .ignoreContentType(true); |
How to ignore HTTP error codes while making a connection?
Jsoup throws IOException if the request results in HTTP errors like “404 – Not found”, “5xx – Internal server error”, or any other HTTP errors. If you want to ignore these HTTP errors, you can use the ignoreHTTPErrors
method and pass true parameter.
1 |
Connection ignoreHTTPErrors(boolean ignoreHTTPErrors) |
Example:
1 2 3 |
//this will ignore HTTP errors while connecting to the URL Connection connection = Jsoup.connect("http://www.example.com") .ignoreHttpErrors(true); |
This will cause the response to be populated with the error body, and connection status will reflect the error if the connection results in any of the errors mentioned above.
How to set the proxy for the Jsoup connection?
If you connect to the internet using the proxy server, the Jsoup connection also needs to be configured to use that proxy too. There are several ways to configure the proxy for Jsoup, but the simplest one is to use the built-in proxy
method as given below.
1 |
Connection proxy(String host, int port) |
This method sets the specified host and port as a proxy for the current request.
Example:
1 2 3 |
//this will set the proxy for the current Jsoup connection Connection connection = Jsoup.connect("http://www.example.com") .proxy("192.168.0.1", 8080); |
You can also use the overloaded proxy
method that accepts the Proxy object instead of this method. The proxy method was introduced with the Jsoup version 1.9.1 onwards.
If you are using an older version than that, there are also different options to set the proxy. Please visit the full how to set Jsoup proxy example to know about these options.
How to set the request referrer (referer) header?
Many webservers check for the request referrer before serving the content. If the referer header is missing, they may send the error instead of the requested HTML document. In this case, you can send the referer header along with the request using the referrer
method.
1 |
Connection referrer(String referrer) |
This method sets the referer header with the given string value.
1 2 3 |
//this will set the referer header for the current Jsoup connection Connection connection = Jsoup.connect("http://www.example.com/page1") .referrer("http://www.example.com"); |
The above example sets the HTTP referer header as “http://www.example.com” while requesting the “http://www.example.com/page1” HTML page. Refer to the full example of how to set Jsoup referer to know more.
How to set the user-agent header?
Just like the referrer, many web servers send back the 5xx forbidden error or internal server error if the HTTP request does not contain a valid user agent. It also happens if the user-agent header is empty, user-agent matches with the known spam bots, or if the server detects that it is machine generated request.
You can set the user-agent header for the request using the userAgent method as given below.
1 2 3 |
//this will set the user-agent header for the current Jsoup connection Connection connection = Jsoup.connect("http://www.example.com") .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"); |
Refer to the full example of how to set user agent using Jsoup to know more.
Tip: Always set the HTTP referrer and user-agent headers when web scraping to avoid forbidden and internal server error responses. Plus, always make sure to wait for at least a couple of seconds before making consecutive requests. Please refer to the example on how to fix 403 – Forbidden error while using Jsoup.
How to set the request timeout?
The default request time out for Jsoup is 30 seconds. It means the Jsoup will wait for 30 seconds for the response to be received before throwing the SocketTimeOutException exception. If you want to specify the custom duration, use the timeout
method.
1 2 3 |
//this will set the timeout of 60 seconds i.e. 60000 in milliseconds Connection connection = Jsoup.connect("http://www.example.com") .timeout(60000); |
Note that the timeout is in milliseconds. Please refer to the example on how to fix ConnectionError: UnsupportedMimeTypeException and how to fix SocketTimeOutException while using the Jsoup to know more.
How to get a Response object from the connection?
The response object is useful in retrieving useful information about the response received from the Jsoup connection like response body, cookies, etc. The execute
method of the Connection executes the request and returns a response as given below.
1 2 3 |
//get the response Response response = Jsoup.connect("http://www.example.com") .execute(); |
How to get cookies from the response?
Web servers often send cookies back to the browser in response to the HTTP requests, for example, login cookie or a cookie containing the last visited page. You can get these cookies using the cookies
method of the Response class as given below.
1 2 3 4 5 6 7 8 9 10 |
//get the response Response response = Jsoup.connect("http://www.google.com") .timeout(60000) .execute(); /* * To get the response cookies, use the cookies method */ Map<String, String> responseCookies = response.cookies(); System.out.println("Response cookies received: " + responseCookies.size()); |
Output
1 |
Response cookies received: 2 |
How to send cookies in a request?
If you want to send the cookie along with the HTTP request, use the cookie
method of the Connection.
1 |
Connection cookie(String strCookieName, String strCookieValue) |
Jsoup request cookie example:
1 2 3 |
Response response = Jsoup.connect("http://www.example.com") .cookie("mycookie", "cookieValue") .execute(); |
The above example will send cookie “mycookie” along with the request. If you have multiple cookies, you can store them in a Map object and send it in the HTTP request using the cookies
method as given below.
1 2 3 4 5 6 7 8 9 10 11 12 |
//map containing all the request cookies Map<String, String> cookieMap = new HashMap<String, String>(); cookieMap.put("cookie1", "value1"); cookieMap.put("cookie2", "value2"); /* * To send the cookie with a request, use * the cookie method of the Connection */ Response response = Jsoup.connect("http://www.example.com") .cookies(cookieMap) .execute(); |
How to set the request method (GET or POST)?
If you do not set any HTTP method for the Jsoup connection, the default method for Jsoup request is a GET method. If you want to set the HTTP method explicitly, use the method
method of the Connection.
1 |
Connection method(Connection.Method httpMethod) |
Connection.Method is an enum that defines below given constants, one for each valid HTTP method.
1 2 3 4 5 6 7 8 |
DELETE GET HEAD OPTIONS PATCH POST PUT TRACE |
The below given example will send a HTTP POST request to the given URL.
1 2 3 4 5 6 7 |
/* * This will send a request using the HTTP POST method to * the requested URL */ Response response = Jsoup.connect("http://www.example.com") .method(Connection.Method.POST) .execute(); |
How to send GET or POST request parameters?
The most common thing one needs to do while scraping the websites is to pass request parameters. If the HTTP request method is GET method, the parameters are appended to URL like “http://www.example.com?param1=val1¶m2=val2”. Here the question mark (?) separates the URL from the GET parameters and each individual parameter is separated by ampersand sign (&). The whole string “param1=val1¶m2=val2” containing parameters and their values is called query string which is visible in the browser’s URL bar.
If the request method is POST, parameters are sent in the request body an not visible in the URL bar of the browser.
Jsoup supports sending the URL parameters regardless of the method being used. Use the data
method of the Connection to send the parameter name-value pairs.
1 |
Connection data(String paramName, String paramValue) |
This method adds a request parameter to the current HTTP request.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
/* * This will send a request parameter param1 having value1 * to the requested URL using the default GET method. */ Response response1 = Jsoup.connect("http://www.example.com") .data("param1", "value1") .execute(); /* * This will send a request parameter param1 having value1 * and param2 having value2 * to the requested URL using the POST method. */ Response response2 = Jsoup.connect("http://www.example.com") .method(Connection.Method.POST) .data("param1", "value1") .data("param2", "value2") .execute(); |
You can also use a Map object containing all parameter name and values with overloaded data
method to send all parameters at once as given below.
1 2 3 4 5 6 7 8 |
//map containing all the request parameters Map<String, String> paramMap = new HashMap<String, String>(); paramMap.put("param1", "value1"); paramMap.put("param2", "value2"); Response response = Jsoup.connect("http://www.example.com") .data(paramMap) .execute(); |
Please refer to the full example of how to post form data using Jsoup example to know more.
Putting it all together
Most of the methods of the Connection mentioned above return back the Connection object so that we can chain them together in a single call as given in the below example. This is more or less how your connection code should look like depending on your requirements.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
//request headers Map<String, String> headerMap = new HashMap<String, String>(); headerMap.put("header1", "value1"); headerMap.put("header2", "value2"); //request cookies Map<String, String> cookieMap = new HashMap<String, String>(); cookieMap.put("cookie1", "value1"); cookieMap.put("cookie2", "value2"); //request parameters Map<String, String> paramMap = new HashMap<String, String>(); paramMap.put("param1", "value1"); paramMap.put("param2", "value2"); /* * The below given connection does the following things * - sends the headers * - sends the cookies * - sends the request parameters * - sets the HTTP request method as POST * - will follow the server redirects * - will ignore any HTTP errors * - will set the request timeout to 60 seconds * - sets the referer header and * - sets the user-agent header */ Response response = Jsoup.connect("http://www.example.com/page1") .headers(headerMap) .cookies(cookieMap) .data(paramMap) .method(Connection.Method.POST) .followRedirects(true) .ignoreHttpErrors(true) .timeout(60000) .referrer("http://www.example.com") .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36") .execute(); |
Special cases while connecting to a URL:
- If the webpage you want to scrape needs basic authentication using a username and password, please refer to how to do basic authentication using Jsoup example.
- If the website you want to scrape needs login, please refer to how to login to a website using Jsoup example.
Understanding the Attribute, Node, Element, and Document classes
Now that we have seen how to connect to a URL and get a response using the Jsoup, in this part of the Jsoup tutorial I will show you how to parse the response and extract data from the HTML.
There are 4 main Jsoup classes we need to understand for scaping a webpage and extracting data from it. These classes are Attribute, Node, Element, and Document class. Here is the class hierarchy of them.
Once you get the Document object from the response, Jsoup provides DOM like methods, for example, getElementById
or getElementsByTag
, to extract the data from the HTML. Jsoup also supports very simple but more powerful JQuery or CSS like selectors to extract the data from the HTML. I will show how to use both of them.
How to navigate the HTML document and find elements using Jsoup?
I will be using the below given example HTML code to extract the data for the rest of the tutorial. I have saved this file at the E:/example-html-file.html location.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
<html> <head> <title>Jsoup Tutorial with Examples</title> </head> <body> <h1>Page Heading 1</h1> <p class="intro">Paragraph 1</p> <div id="greeting">Hello reader</div> <ul class="list languages"> <li data-index="1">Java</li> <li data-index="2">C++</li> </ul> <ul class="list mobiles"> <li data-name="nokia">Nokia</li> <li data-name="samsung">Samsung</li> </ul> <a href="#" data-location="wikipedia.org" target="_blank"> <img src="/wiki.jpg" alt="wikipedia image"> </a> <p> This is the last paragraph <br /> <ol> <li>One</li> <li>Two</li> <li class="mid-li">Three</li> <li>Four</li> <li>Five</li> </ol> </p> <img src="/end.png"> <div>Bye</div> </body> </html> |
I am loading the local HTML file using the code given below and will be using the same Document object for extracting the data from it.
1 2 |
File htmlFile = new File("E:/example-html-file.html"); Document document = Jsoup.parse(htmlFile, "UTF-8"); |
How to select HTML tags by id?
1 2 3 4 5 6 7 8 |
//this will select element with id greeting Element element1 = document.getElementById("greeting"); System.out.println(element1); //this does the same but it is in selector syntax Elements elements = document.select("#greeting"); if(elements.size() > 0) System.out.println(elements.get(0)); |
Output
1 2 3 4 5 6 |
<div id="greeting"> Hello reader </div> <div id="greeting"> Hello reader </div> |
How to select HTML tags by name?
1 2 3 4 5 6 7 8 9 |
//this will select webpage title Elements elements = document.getElementsByTag("title"); if(elements.size() > 0) System.out.println(elements.get(0)); //this will also fetch webpage title tag but it is in selector syntax Elements elements1 = document.select("title"); if(elements1.size() > 0) System.out.println(elements1.get(0)); |
Output
1 2 |
<title>Jsoup Tutorial with Examples</title> <title>Jsoup Tutorial with Examples</title> |
How to select HTML tags by CSS class name?
1 2 3 4 5 6 7 8 9 10 11 12 13 |
System.out.println("Fetching using DOM syntax"); //this will select element with class name = list Elements elements = document.getElementsByClass("list"); if(elements.size() > 0) System.out.println(elements.get(0)); System.out.println("Fetching using CSS syntax"); //this does the same but it is in selector syntax Elements elements1 = document.select(".list"); if(elements1.size() > 0) System.out.println(elements1.get(0)); |
Output
1 2 3 4 5 6 7 8 9 10 |
Fetching using DOM syntax <ul class="list languages"> <li data-index="1">Java</li> <li data-index="2">C++</li> </ul> Fetching using CSS syntax <ul class="list languages"> <li data-index="1">Java</li> <li data-index="2">C++</li> </ul> |
You can also specify multiple class names while extracting the data using the Jsoup as given below.
1 2 3 4 5 6 7 8 9 10 11 12 |
System.out.println("Fetching using DOM syntax"); Elements elements = document.getElementsByClass("list mobiles"); if(elements.size() > 0) System.out.println(elements.get(0)); System.out.println("Fetching using CSS syntax"); //this does the same but it is in selector syntax Elements elements1 = document.select(".list.mobiles"); if(elements1.size() > 0) System.out.println(elements1.get(0)); |
Output
1 2 3 4 5 6 7 8 9 10 |
Fetching using DOM syntax <ul class="list mobiles"> <li data-name="nokia">Nokia</li> <li data-name="samsung">Samsung</li> </ul> Fetching using CSS syntax <ul class="list mobiles"> <li data-name="nokia">Nokia</li> <li data-name="samsung">Samsung</li> </ul> |
You can also fetch all the HTML tags having a specified class name or specified id.
1 2 3 4 5 6 7 8 9 10 11 |
//this will fetch all p tags with class name = "intro" System.out.println("Selecting HTML tag name having specified class name"); Elements elements = document.select("p.intro"); if(elements.size() > 0) System.out.println(elements.get(0)); System.out.println("Selecting HTML tag having specified id"); //or all div tags with id = "greeting" Elements elements1 = document.select("div#greeting"); if(elements1.size() > 0) System.out.println(elements1.get(0)); |
Output
1 2 3 4 5 6 |
Selecting HTML tag name having specified class name <p class="intro">Paragraph 1</p> Selecting HTML tag having specified id <div id="greeting"> Hello reader </div> |
How to select HTML elements by attributes?
Select elements having specified attribute:
1 2 3 4 5 6 7 8 9 10 |
//this will fetch all elements having src attribute System.out.println("Selecting HTML elements having src attribute"); Elements elements = document.getElementsByAttribute("src"); if(elements.size() > 0) System.out.println(elements.get(0)); System.out.println("Selecting HTML elements having src attribute usign selector"); Elements elements1 = document.select("[src]"); if(elements1.size() > 0) System.out.println(elements1.get(0)); |
Output
1 2 3 4 |
Selecting HTML elements having src attribute <img src="/wiki.jpg" alt="wikipedia image"> Selecting HTML elements having src attribute usign selector <img src="/wiki.jpg" alt="wikipedia image"> |
You can also select a specified element having a specified attribute as given below.
1 2 3 4 |
System.out.println("Selecting all div having id attribute"); Elements elements1 = document.select("div[id]"); if(elements1.size() > 0) System.out.println(elements1.get(0)); |
Output
1 2 3 4 |
Selecting all div having id attribute <div id="greeting"> Hello reader </div> |
Select elements having an attribute name starting with a text:
1 2 3 4 5 6 7 8 9 10 |
System.out.println("Selecting HTML elements having attribute name starting with data-"); Elements elements1 = document.getElementsByAttributeStarting("data-"); if(elements1.size() > 0) System.out.println(elements1.get(0)); //or using a selector System.out.println("Selecting HTML elements having attribute name starting with data- using selector"); Elements elements2 = document.select("[^data-]"); if(elements2.size() > 0) System.out.println(elements2.get(0)); |
Output
1 2 3 4 |
Selecting HTML elements having attribute name starting with data- <li data-index="1">Java</li> Selecting HTML elements having attribute name starting with data- using selector <li data-index="1">Java</li> |
Select elements having specified attribute with given value:
1 2 3 4 5 6 7 8 9 10 |
System.out.println("Selecting HTML elements having specific attribute value"); Elements elements1 = document.getElementsByAttributeValue("data-index", "2"); if(elements1.size() > 0) System.out.println(elements1.get(0)); //or using a selector System.out.println("Selecting HTML elements having specific attribute value using selector"); Elements elements2 = document.select("[data-index=2]"); if(elements2.size() > 0) System.out.println(elements2.get(0)); |
Output
1 2 3 4 |
Selecting HTML elements having specific attribute value <li data-index="2">C++</li> Selecting HTML elements having specific attribute value using selector <li data-index="2">C++</li> |
Select elements having a specified attribute with a matching value:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
System.out.println("Selecting HTML elements having data-location attribute value starting with wiki"); Elements elements1 = document.getElementsByAttributeValueStarting("data-location", "wiki"); if(elements1.size() > 0) System.out.println(elements1.get(0)); //or using a selector System.out.println("Selecting HTML elements having data-location attribute value starting with wiki using selector"); Elements elements2 = document.select("[data-location^=wiki]"); if(elements2.size() > 0) System.out.println(elements2.get(0)); System.out.println("Selecting HTML elements having class attribute value ending with tro"); Elements elements3 = document.getElementsByAttributeValueEnding("class", "tro"); if(elements3.size() > 0) System.out.println(elements3.get(0)); //or using a selector System.out.println("Selecting HTML elements having class attribute value ending with tro using selector"); Elements elements4 = document.select("[class$=tro]"); if(elements4.size() > 0) System.out.println(elements4.get(0)); System.out.println("Selecting HTML elements having id attribute value containing eeti"); Elements elements5 = document.getElementsByAttributeValueContaining("id", "eeti"); if(elements5.size() > 0) System.out.println(elements5.get(0)); //or using a selector System.out.println("Selecting HTML elements having attribute value containing eeti using selector"); Elements elements6 = document.select("[id*=eeti]"); if(elements6.size() > 0) System.out.println(elements6.get(0)); System.out.println("Selecting HTML elements having data-name attribute value matching regex"); Elements elements7 = document.getElementsByAttributeValueMatching("data-name", ".*sung$"); if(elements7.size() > 0) System.out.println(elements7.get(0)); //or using a selector System.out.println("Selecting HTML elements having data-name attribute value matching regex using selector"); Elements elements8 = document.select("[data-name~=.*sung$]"); if(elements8.size() > 0) System.out.println(elements8.get(0)); |
Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Selecting HTML elements having data-location attribute value starting with wiki <a href="#" data-location="wikipedia.org" target="_blank"> <img src="/wiki.jpg" alt="wikipedia image"> </a> Selecting HTML elements having data-location attribute value starting with wiki using selector <a href="#" data-location="wikipedia.org" target="_blank"> <img src="/wiki.jpg" alt="wikipedia image"> </a> Selecting HTML elements having class attribute value ending with tro <p class="intro">Paragraph 1</p> Selecting HTML elements having class attribute value ending with tro using selector <p class="intro">Paragraph 1</p> Selecting HTML elements having id attribute value containing eeti <div id="greeting"> Hello reader </div> Selecting HTML elements having attribute value containing eeti using selector <div id="greeting"> Hello reader </div> Selecting HTML elements having data-name attribute value matching regex <li data-name="samsung">Samsung</li> Selecting HTML elements having data-name attribute value matching regex using selector <li data-name="samsung">Samsung</li> |
How to get children of the HTML elements using the Jsoup?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
//get element having class name languages Elements elements9 = document.getElementsByClass("languages"); if(elements9.size() > 0){ //get the first UL Element ulElement = elements9.get(0); System.out.println("Selecting all HTML child elements"); //get all child li elements Elements childElements = ulElement.children(); System.out.println(childElements); System.out.println("Selecting first child element by index"); System.out.println( ulElement.child(0) ); System.out.println("Selecting last child element by index"); System.out.println( ulElement.child( ulElement.children().size() - 1 ) ); } //using the selector style //select all child li elements of ul having class name languages System.out.println("Selecting all HTML child elements using selector"); Elements childElements = document.select("ul.languages > li"); System.out.println(childElements); //select first child element System.out.println("Selecting first child element using selector"); Element firstChild = document.select("ul.languages > li").first(); System.out.println(firstChild); //select last child element System.out.println("Selecting last child element using selector"); Element lastChild = document.select("ul.languages > li").last(); System.out.println(lastChild); //select last child element System.out.println("Selecting child element by index using selector"); Element child = document.select("ul.languages > li").get(0); System.out.println(child); |
Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Selecting all HTML child elements <li data-index="1">Java</li> <li data-index="2">C++</li> Selecting first child element by index <li data-index="1">Java</li> Selecting last child element by index <li data-index="2">C++</li> Selecting all HTML child elements using selector <li data-index="1">Java</li> <li data-index="2">C++</li> Selecting first child element using selector <li data-index="1">Java</li> Selecting last child element using selector <li data-index="2">C++</li> Selecting child element by index using selector <li data-index="1">Java</li> |
The below given code will select all img child elements of the body element.
1 2 3 |
//get HTML body element Element bodyElement = document.getElementsByTag("body").first(); System.out.println( document.select("body img") ); |
Output
1 2 |
<img src="/wiki.jpg" alt="wikipedia image"> <img src="/end.png"> |
If you want to select only direct child elements, use the following syntax.
1 2 3 4 |
//get HTML body element Element bodyElement = document.getElementsByTag("body").first(); //this will only selct direct child of body element, notice > sign System.out.println( document.select("body > img") ); |
Output
1 |
<img src="/end.png"> |
How to get siblings of HTML elements using Jsoup?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
//get li with class "mid-li" Element liElement = document.getElementsByClass("mid-li").first(); //get all siblings System.out.println("Selecting all sibling elements of li"); Elements siblings = liElement.siblingElements(); System.out.println(siblings); //get first sibling element System.out.println("First sibling element"); System.out.println( liElement.firstElementSibling() ); //get last sibling element System.out.println("Last sibling element"); System.out.println( liElement.lastElementSibling() ); //get all previous sibling elements System.out.println("All previous sibling elements"); System.out.println( liElement.previousElementSiblings() ); //get all next sibling elements System.out.println("All next sibling elements"); System.out.println( liElement.nextElementSiblings() ); //get previous sibling element System.out.println("Previous sibling element"); System.out.println( liElement.previousElementSibling() ); //get next sibling element System.out.println("Next sibling element"); System.out.println( liElement.nextElementSibling() ); |
Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Selecting all sibling elements of li <li>One</li> <li>Two</li> <li>Four</li> <li>Five</li> First sibling element <li>One</li> Last sibling element <li>Five</li> All previous sibling elements <li>Two</li> <li>One</li> All next sibling elements <li>Four</li> <li>Five</li> Previous sibling element <li>Two</li> Next sibling element <li>Four</li> |
How to select a parent element of an element?
1 2 3 4 5 6 7 |
//get li with class "mid-li" Element liElement = document.getElementsByClass("mid-li").first(); //get the parent element System.out.println("Select parent element"); Element parent = liElement.parent(); System.out.println(parent); |
Output
1 2 3 4 5 6 7 8 |
Select parent element <ol> <li>One</li> <li>Two</li> <li class="mid-li">Three</li> <li>Four</li> <li>Five</li> </ol> |
Advanced Pseudo Selectors
The Jsoup selector offers advanced Pseudo selectors to find elements. Finding this elements is not possible or easy using the DOM style as given below.
How to select elements with sibling index less than, greater than or equal to the given index?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
System.out.println("Selecting all li child with index less than 3"); /* * Get all child li element of ol whose index is * less than 3 (i.e. at index 0, 1, and 2) */ Elements lis = document.select("ol").first().select(":lt(3)"); System.out.println(lis); System.out.println("Selecting all li child with index greater than 3"); /* * Get all child li element of ol whose index is * greater than 3 (i.e. at index 0, 1, and 2) */ Elements lis1 = document.select("ol").first().select("li:gt(3)"); System.out.println(lis1); System.out.println("Select child li element having index 1"); /* * Get the child li element of ol whose index equal to 1 */ Elements li = document.select("ol").first().select("li:eq(1)"); System.out.println(li); |
Output
1 2 3 4 5 6 7 8 |
Selecting all li child with index less than 3 <li>One</li> <li>Two</li> <li class="mid-li">Three</li> Selecting all li child with index greater than 3 <li>Five</li> Select child li element having index 1 <li>Two</li> |
How to find elements containing specified other elements?
The below example shows how to find elements containing a specific element, for example, all link elements containing images.
1 2 3 4 5 |
/* * This will return all link (a elements) containing images */ Elements links = document.select("a:has(img)"); System.out.println(links); |
Output
1 |
<a href="#" data-location="wikipedia.org" target="_blank"> <img src="/wiki.jpg" alt="wikipedia image"> </a> |
How to find elements not matching the specified selector?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
System.out.println("div elements without having id attribute"); /* * This will return all divs not having id attribute */ Elements divs = document.select("div:not([id])"); System.out.println(divs); System.out.println("UL elements without having given class names"); /* * This will select all ul elements not having * class "list languages" */ Elements uls = document.select("ul:not([class=list languages])"); System.out.println(uls); |
Output
1 2 3 4 5 6 7 8 9 |
div elements without having id attribute <div> Bye </div> UL elements without having given class names <ul class="list mobiles"> <li data-name="nokia">Nokia</li> <li data-name="samsung">Samsung</li> </ul> |
How to find elements containing specified text?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
System.out.println("ul containing text nokia"); /* * This will return all ul elements having text nokia */ Elements uls = document.select("ul:contains(nokia)"); System.out.println(uls); System.out.println("div containing text hello"); /* * This will return all div elements having text read */ Elements divs = document.select("div:contains(read)"); System.out.println(divs); |
Output
1 2 3 4 5 6 7 8 9 |
ul containing text nokia <ul class="list mobiles"> <li data-name="nokia">Nokia</li> <li data-name="samsung">Samsung</li> </ul> div containing text hello <div id="greeting"> Hello reader </div> |
The above given :contains
selector returns an element if any of the child elements have the matching text. If you want to search the element text only excluding the child element text, use the :containsOwn
selector instead of the :contains
selector.
How to find elements containing text matching with regex?
1 2 3 4 5 6 7 8 9 |
System.out.println("ul containing matching text with regex"); /* * This will return all ul elements with text matching * with regex pattern "C[+]?" (Character C followed * by zero or more + sign) */ Elements uls = document.select("ul:matches(C[+]?)"); System.out.println(uls); |
Output
1 2 3 4 5 |
ul containing matching text with regex <ul class="list languages"> <li data-index="1">Java</li> <li data-index="2">C++</li> </ul> |
Use the :matchesOwn
to match the text of the given element only, excluding the text of the child elements.
There are many more interesting selectors which I am skipping to keep the length of this tutorial reasonable. You can refer to them at Jsoup selector syntax page.
How to extract data from HTML using the Jsoup?
Once you have found the elements you want to extract the data from, its fairly easy task to extract the data.
How to get the id of an element?
Use the id
method to get the id attribute of the HTML element.
1 2 3 4 5 |
//get the first div inside body tag Element divElement = document.getElementsByTag("div").first(); //get the id of the div System.out.println( "first div id: " + divElement.id() ); |
Output
1 |
first div id: greeting |
How to get the tag name of an element?
Use the tagName
method to get the tag name of the element.
1 2 3 4 5 |
//get the first child of the body tag Element firstElement = document.getElementsByTag("body").first().children().first(); //get the name of the first child of the body tag System.out.println( "first child tag: " + firstElement.tagName() ); |
Output
1 |
first child tag: h1 |
How to get CSS class names of an element?
Use the className
method to get the value of the class attribute of the element. If the element has multiple classes, they are returned in space separated format.
1 2 3 4 5 6 7 8 9 10 11 |
//get the first paragraph (p) tag Element firstPElement = document.select("p").first(); //get the class attribute value System.out.println( "CSS class name: " + firstPElement.className() ); //get the first ul tag Element firstListElement = document.select("ul").first(); //get the class attribute value System.out.println( "CSS class name: " + firstListElement.className() ); |
Output
1 2 |
CSS class name: intro CSS class name: list languages |
As you can see from the output, in the case of multiple classes, the class names are returned in the same string separated by space. If you want the individual class names, use the classNames
method as given below.
1 2 3 4 5 6 7 8 9 10 |
//get the first ul tag Element firstListElement = document.select("ul").first(); System.out.println("Get multiple class names"); Set<String> classNames = firstListElement.classNames(); for(String strClassName : classNames){ System.out.println(strClassName); } |
Output
1 2 3 |
Get multiple class names list languages |
The classNames
method returns a Set of String elements containing individual class names. If the element contains duplicate class names in the class attribute, they will be removed (because the Set does not allow duplicate elements).
How to get the text of an element?
Use the text
method to get the element text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
//get the website title Element title = document.select("title").first(); //get the title text System.out.println("Website title: " + title.text()); //get the website heading h1 Element h1 = document.select("h1").first(); //get the h1 text System.out.println("Website heading h1: " + h1.text()); /* * If the element has child elements within it, the text * method returns the text of child elements too. */ //get the last list element Element ul = document.select("ul").last(); //get the text, this will include text of child li elements System.out.println( "List element text: " + ul.text() ); |
Output
1 2 3 |
Website title: Jsoup Tutorial with Examples Website heading h1: Page Heading 1 List element text: Nokia Samsung |
How to get the inner HTML of an element?
Use the html
method to get the element’s inner HTML code.
1 2 3 4 5 |
//get the ol list element Element ol = document.select("ol").first(); //get the inner html System.out.println( "List element html:\n" + ol.html() ); |
Output
1 2 3 4 5 6 |
List element html: <li>One</li> <li>Two</li> <li class="mid-li">Three</li> <li>Four</li> <li>Five</li> |
How to get the outer HTML of an element?
Use the outerHTML
method to get the element’s outer HTML code.
1 2 3 4 5 |
//select all ul list elements Elements uls = document.select("ul"); //this will print outer HTML of all list elements combined System.out.println(uls.outerHtml()); |
Output
1 2 3 4 5 6 7 8 |
<ul class="list languages"> <li data-index="1">Java</li> <li data-index="2">C++</li> </ul> <ul class="list mobiles"> <li data-name="nokia">Nokia</li> <li data-name="samsung">Samsung</li> </ul> |
Similarly, you can use the toString
method to get the outer HTML of the element(s).
How to get the attribute value of a specific attribute of any element?
Use the attr
method to get the value of the specified attribute of the given element.
1 2 3 4 5 |
//select first link element (a) Element a = document.select("a").first(); //get the value of href attribute System.out.println("Link href attribute: " + a.attr("href")); |
Output
1 |
Link href attribute: # |
How to get all attributes of an element?
Use the attributes
method to get all the attributes of an element.
1 2 3 4 5 6 7 8 9 10 11 12 |
//select first link element (a) Element a = document.select("a").first(); //get all attributes of an element Attributes attributes = a.attributes(); System.out.println("Get all attributes of an element a:"); //iterate all attributes for(Attribute attribute : attributes){ System.out.println( attribute.getKey() + " => " + attribute.getValue() ); } |
Output
1 2 3 4 |
Get all attributes of an element a: href => # data-location => wikipedia.org target => _blank |
Apart from these methods to extract the data from HTML elements, Jsoup also provides methods to manipulate or change the DOM, but those methods are beyond the scope of this tutorial. You can learn it at Jsoup site.
Below given are some additional Jsoup examples which cover the individual topics in more detail.
Jsoup Examples
- How to post form data using Jsoup
- How to login to any website using Jsoup (POST method)
- How to download images from any webpage using Jsoup
- How to perform basic authentication using Jsoup
- How to remove HTML tags from String using Jsoup
- How to find CSS selector for any HTML element for Jsoup extraction
- How to iterate HTML elements using Jsoup
- How to select elements with multiple CSS classes using Jsoup
- How to preserve new lines while parsing HTML using Jsoup
- How to clean HTML using Jsoup
- How to set character encoding for Jsoup parsing
- How to get absolute URL from HTML relative URL while using Jsoup
- How to fix error 403 Forbidden Exception while using Jsoup
- How to fix Connection error: UnsupportedMimeTypeException while using Jsoup
- How to fix SocketTimeoutException, Read timeout and Connect timeout exceptions while using Jsoup
- How to set proxy for Jsoup
- How to set referer (referrer) for Jsoup connection
- How to set user agent for Jsoup connection
Please let me know if you liked the Jsoup tutorial with examples in the comments section below.
Hi Rahim,
Thank you so much for your Jsoup tutorials with examples.
I am looking for a sample code to scrap or crawl a website content after login using the user id and password.
I can see example code to login with action URL.
But, the website that I want to scrap does not show action URL in the Elements tab in Chrome Devtools window.
Could you please share sample program to login a website with user id and password and fetch the web page contents after successful login?
Hello Rajakumar,
In that case, open the login page in chrome. Once it is loaded, open the chrome dev tools and navigate to the Network tab. Clear all the previous records, if there are any. Then enter your user id and password and click the login button. The network tab will display the exact HTTP request webpage is making. Click on the relevant row from the network tab to see more details like request type, request parameters, etc.
I hope it answers your question.
Hi RahimV,
Thank you for your reply.