This example shows how to remove non ascii characters from String in Java using various regular expression patterns and string replaceAll method.
How to remove non ascii characters from String in Java?
Many times you want to remove non ascii characters from the string. Consider below given string containing the non ascii characters.
1 |
String strValue = "Ã string çöntäining nön äsçii çhäräçters"; |
To remove them, we are going to use the “[^\\x00-\\x7F]” regular expression pattern where,
1 2 3 |
^ - Not \\x00 - 0 in hexa-decimal \\7F - 127 in hexa-decimal |
So our pattern “[^\\x00-\\x7F]” means “not in 0 to 127” which is the range of the ASCII characters. Here is the example program using this pattern.
1 2 3 |
String strValue = "Ã string çöntäining nön äsçii çhäräçters"; System.out.println( strValue.replaceAll( "[^\\x00-\\x7F]", "" ) ); |
Output
1 |
string ntining nn sii hrters |
Alternatively, you can also use the “\\P{InBasic_Latin}” pattern as given below.
1 2 |
String strValue = "Ã string çöntäining nön äsçii çhäräçters"; System.out.println(strValue.replaceAll("\\P{InBasic_Latin}", "")); |
Output
1 |
string ntining nn sii hrters |
How to replace non ascii characters with the ASCII equivalent character?
What if you want to replace “ä” with “a” instead of removing it? You can do that by normalizing the string first and then replace the characters as given below.
Output
1 |
A string containing non ascii characters |
Alternatively, you can also use the “[^\\p{ASCII}]” pattern as given below.
1 2 3 4 |
String strValue = "Ã string çöntäining nön äsçii çhäräçters"; String str = Normalizer.normalize(strValue, Normalizer.Form.NFD); System.out.println( str.replaceAll( "[^\\p{ASCII}]", "" ) ); |
Output
1 |
A string containing non ascii characters |
If the text is in Unicode format, the “[\\p{M}]” pattern should be used instead of the “[^\\p{ASCII}]” pattern as given below.
1 2 3 4 |
String strValue = "Ã string çöntäining nön äsçii çhäräçters"; String str = Normalizer.normalize(strValue, Normalizer.Form.NFD); System.out.println( str.replaceAll( "[\\p{M}]", "" ) ); |
Output
1 |
A string containing non ascii characters |
In a regular expression, the “\\p{M}” pattern matches the accent while the “\\P{M}” pattern matches the glyph of a Unicode character.
Finally, if you are using the Apache Commons library, you can use the stripAccents
method of the StringUtils class to remove accents from the Unicode characters as given below.
1 2 3 |
String strValue = "Ã string çöntäining nön äsçii çhäräçters"; System.out.println(StringUtils.stripAccents(strValue)); |
Output
1 |
A string containing non ascii characters |
How to remove only non-printable characters?
If you want to keep only printable characters and remove all the non-printable characters from the string you can use below given code.
1 2 3 |
//remove all non printable characters String strValue = "Ã string çöntäining nön äsçii çhäräçters"; System.out.println( strValue.replaceAll("\\P{Print}", "") ); |
Please note that above code also removes \t (tab), \n (new line) and \r (carriage return) characters as well.
This example is a part of the Java String tutorial and Java RegEx tutorial.
Please let me know your views in the comments section below.
If I want to remove some of the non acsii characters from String what should I use
Hello Shital,
In that case, I believe you need to replace them individually. You can still use the regular expression, but instead of specifying the range, you need to provide the exact characters you want to replace.
I hope it helps.
Thanks.
thanks very much, it’s perfect
Glad you liked it. Thanks.