Jsoup set character encoding example shows how to set character encoding while using Jsoup. The example also shows how to set the character encoding to ISO-8859-1 or UTF-8.
How to set character encoding using Jsoup?
Jsoup automatically detects the charset for the webpage being crawled. However, many of the websites do not set character set encoding along with the content-type header by not defining charset. If you crawl such a webpage, Jsoup parses the page using the platform’s default character set.
That also means that you might not get expected results as the platform’s default character set might be different from the webpage you are crawling. It might result in the loss of characters or them being parsed/printed incorrectly.
How to set character encoding (charset) if the response does not specify it?
You can get the stream from the connection and set your desired character set using the InputStream class and the parse
method of the Jsoup as given below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
package com.javacodeexamples.libraries.jsoup; import java.io.IOException; import java.io.InputStream; import java.net.MalformedURLException; import java.net.URL; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class JsoupCharacterEncodingExample { public static void main(String[] args) { try { String strURL = "http://www.example.com"; //get input stream from the URL InputStream inStream = new URL(strURL).openStream(); //parse document using input stream and specify the charset Document doc = Jsoup.parse(inStream, "ISO-8859-1", strURL); //..do you processing } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } } |
Please also make sure that you set the proper user agent and referer headers.
This example is a part of the Jsoup tutorial with examples.
Please let me know your views in the comments section below.