Can two different strings when encoded with different encodings have the same byte sequence?

Question 1

Can two different strings when encoded with different encodings have the same byte sequence? i.e. some "string one" and "string two" in the example below when encoded using two different encodings (Cp1252 and UTF-8 are just examples) will cause the test to pass?

import java.io.UnsupportedEncodingException;
import java.util.Arrays;
import org.junit.Assert;
import org.junit.Test;
public class EncodingTest {
 @Test
 public void test() throws UnsupportedEncodingException {
 final byte[] sequence1 = "string one".getBytes("Cp1252");
 final byte[] sequence2 = "string two".getBytes("UTF-8");
 Assert.assertTrue(Arrays.equals(sequence1, sequence2));
 }
}

A bug in my code hashes byte sequence generated from a String with JVM's default encoding and I need to verify whether that will cause hash collisions when the code is run with different strings and different JVM file encodings (which can happen when run on Windows and Linux for example).

Since an encoding is a mapping between byte sequences and characters, I think there may be some strings and encodings that pass the above test. But just wanted to know if there are any well known examples or some good reasons for why I shouldn't be relying on hash collisions not happening.

Thanks

PS: This is only for encodings supported by JDK 1.6 and not by some made up ones.

Question 2

Using the "default encoding" is often .. suspect.

Question 3

Note that this question is asking the inverse of what some answers are responding to; it is not if two identical strings with different encodings can generate the same byte-sequence, it is asking if two different strings with different encodings can generate the same byte-sequence. (And more specifically, if there is a "known case" of such a collisions.)

Question 4

why do hash collisions matter? hash codes don't need to be unique.

Question 5

you do realize that unless you live in a world with at most 4 billion possible Strings, then there will, by definition, be hash collisions.

Question 6

Yes. To take a simple example, the string "¡" (the inverted exclamation mark) encoded as ISO-8859-1 and the string "Ą" (capital A with ogoned) encodes as ISO-8859-2 both become the single-byte sequence A1 (hex). It is more or less obvious that such things happen when using the very simple encodings that map characters to single bytes; otherwise they would not be different encodings. It can surely happen when more complicated encoding schemes are involved, too.

Question 7

Here's an easy one: most codepages and UTF-8 share the ASCII encoding (0x00 = 0x7F). If your text is in plain English, there's a big chance that it's in ASCII -- whatever the declared encoding, since it'd use mostly plain, non-accented characters.

Question 8

How does that lead to two different input strings that end up with the same bytes from getBytes()?

Question 9

If the source string is in an encoding that supports multi-byte characters and the target encoding is one that does not support multi-byte characters, it seems reasonable that one could get a collision since multi-byte characters will require a mapping to a single byte character set.

For example if the input strings are written in Chinese and the target character set is US-ASCII, many Chinese characters will certainly be mapped to the same US-ASCII representation.

Question 10

This code should produce an example eventually:

 while(true){
 Random r = new Random();
 byte[] bytes = new byte[4];
 r.nextBytes(bytes);
 try{
 String raw = Arrays.toString(bytes);
 String utf8 = new String(bytes, "UTF-8");
 String latin1 = new String(bytes, "ISO-LATIN-1");
 System.out.println(raw + " is " + utf8 + " or " + latin1);
 break;
 }catch(Exception e){}
 }

Question 11

Yes, it is possible, at least for strings of different lengths.

The string "\u2020" (or "†") is encoded as 0x20,0x20 in UTF-16. This is also what "\x20\x20" (a string of two ASCII spaces) is encoded to in ASCII.

Of course, The Dagger, doesn't come up in language very often [=^_^=], but some standard [non-Latin] alphabets could generate similar byte-sequences that map onto a standard (non-control character) ASCII encoding .. and many more if the restriction about control characters relaxed.

It would would be more interesting to find a case where two similar "realistic" strings (e.g. same length and "sensible data") could map onto the same byte-sequence with different encodings ..

score 2 · Accepted Answer · 2012-07-20 04:47:11Z

Yes. To take a simple example, the string "¡" (the inverted exclamation mark) encoded as ISO-8859-1 and the string "Ą" (capital A with ogoned) encodes as ISO-8859-2 both become the single-byte sequence A1 (hex). It is more or less obvious that such things happen when using the very simple encodings that map characters to single bytes; otherwise they would not be different encodings. It can surely happen when more complicated encoding schemes are involved, too.

CollectivesTM on Stack Overflow

Can two different strings when encoded with different encodings have the same byte sequence?

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related