2

I'm calling a service which errors, telling me there is an encoding problem with the below String:

Universal®

It is my understanding that this String is "utf8" encoded. Is this a correct understanding of utf8 encoding? If so, does this indicate that I should remove the utf8 encoding? If so, any suggestions on how I can de-encode a utf8 String in Java?

Or am I wrong, and the above String is not utf8 encoded? If so, any suggestions how to encode it?

Cœur
39k25 gold badges207 silver badges282 bronze badges
asked Sep 11, 2015 at 13:55
7
  • 2
    "an encoding problem" - can you be more specific? Is it not displayed properly, or has it given you a particular error? Commented Sep 11, 2015 at 13:59
  • Sorry, that is part of my issue, the error I receive is precisely that vague. It says "Encoding problem". Which is why I'm wondering if it's implying that I'm utf8 encoded when I shouldn't be, or that I'm not utf8 encoded and I should be. Commented Sep 11, 2015 at 14:03
  • "calling a service" - how? SOAP? java method call? Commented Sep 11, 2015 at 14:09
  • In the context of a Java String object, internally the String isnot encoded in UTF-8. It is encoded in UTF-16. That's largely irrelevant, though: the issue is all about how you transfer the string data to the service you are trying to call, and about how that service expects you to do so. Apparently those are mismatched. Commented Sep 11, 2015 at 14:09
  • 1
    Looking at the rendered output of a string doesn't tell anything. It's the underlying binary values that make sense. Joel on Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) Commented Sep 11, 2015 at 14:57

4 Answers 4

6

How Java stores the string isn't the same as how it is encoded in messages. You can try something like:

String s = "Universal®";
byte[] encoded = s.getBytes(Charset.forName("UTF-8"));

You'll have to catch the UnsupportedCharsetException, but UTF-8 is a standard available charset.

Or you may need to set the encoding in the sending API, like in HTTP Content-Type: text/plain; charset=UTF-8.

answered Sep 11, 2015 at 14:11
Sign up to request clarification or add additional context in comments.

2 Comments

For Java 7+, you can use StandardCharsets.UTF_8 instead of Charset.forName("UTF-8")
@Andreas Sweet! I hadn't caught that change. Now I don't have to get annoyed at useless boilerplate try/catch to get the UTF-8 charset. :)
2

"Universal®" with ® U+00AE cannot be represented in plain 7-bits ASCII, Though it can in several other charsets/encodings. The universal Unicode encoding UTF-8 can mix any script.

You need the text converted as bytes in some encoding to be able to state its encoding.

In java String is Unicode internally and can deal with everything.

As the java source encoding is free however, it must be the same encoding as used by the java compiler javac. You can however use the u-escaping, using ASCII to represent the special symbols (in the UTF-16 range):

String s = "Universal\u00AE";
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
String t = new String(bytes, StandardCharsets.UTF_8);
assert t.equals(s);
answered Sep 11, 2015 at 14:36

Comments

1

In a very general sense, encoding is just the assortment and allocation of bits, that is used to represent strings. See the link below for more detailed information. Generally all encoding types are fairly transferrable to each other, but there is a few exceptions to this. You have probably seen the large blank squares/etc that mark a symbol that cannot be displayed. This is generally caused by an encoding error (such as the character not existing for that encoding scheme).

https://en.wikipedia.org/wiki/UTF-8

As per your specific problem, that string listed should be UTF-8 Encodable. It may have been saved in another encoding type (which may cause your issue). You could always attempt to convert it to UTF-8 and see what happens.

Edit - In regards to the comments, I expect the issue is related to not encoding it properly before attempting to transfer it via the service (or to the service).

answered Sep 11, 2015 at 14:04

1 Comment

Everything you say is true, but this doesn't seem to actually answer the question very well.
-2

A quick look here: http://www.utf8-chartable.de/ (and we should know that without looking, people) shows that @ is indeed a utf8 character. So, dunno what framework complains about it not being such, but it's wrong

answered Sep 11, 2015 at 13:58

2 Comments

That @ should be ®, however it is still available for UTF-8 (Registered Sign I think)
UTF-8 is an encoding. Yes, it can encode the character ®, but that has no bearing on whether a particular byte sequence encoding that character in fact employs UTF-8, as opposed to any of the several alternatives.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.