Can two different strings when encoded with different encodings have the same byte sequence? i.e. some "string one" and "string two" in the example below when encoded using two different encodings (Cp1252 and UTF-8 are just examples) will cause the test to pass?
import java.io.UnsupportedEncodingException;
import java.util.Arrays;
import org.junit.Assert;
import org.junit.Test;
public class EncodingTest {
@Test
public void test() throws UnsupportedEncodingException {
final byte[] sequence1 = "string one".getBytes("Cp1252");
final byte[] sequence2 = "string two".getBytes("UTF-8");
Assert.assertTrue(Arrays.equals(sequence1, sequence2));
}
}
A bug in my code hashes byte sequence generated from a String with JVM's default encoding and I need to verify whether that will cause hash collisions when the code is run with different strings and different JVM file encodings (which can happen when run on Windows and Linux for example).
Since an encoding is a mapping between byte sequences and characters, I think there may be some strings and encodings that pass the above test. But just wanted to know if there are any well known examples or some good reasons for why I shouldn't be relying on hash collisions not happening.
Thanks
PS: This is only for encodings supported by JDK 1.6 and not by some made up ones.
-
Using the "default encoding" is often .. suspect.user166390– user1663902012年07月20日 02:01:33 +00:00Commented Jul 20, 2012 at 2:01
-
1Note that this question is asking the inverse of what some answers are responding to; it is not if two identical strings with different encodings can generate the same byte-sequence, it is asking if two different strings with different encodings can generate the same byte-sequence. (And more specifically, if there is a "known case" of such a collisions.)user166390– user1663902012年07月20日 02:03:29 +00:00Commented Jul 20, 2012 at 2:03
-
why do hash collisions matter? hash codes don't need to be unique.jtahlborn– jtahlborn2012年07月20日 03:23:28 +00:00Commented Jul 20, 2012 at 3:23
-
you do realize that unless you live in a world with at most 4 billion possible Strings, then there will, by definition, be hash collisions.jtahlborn– jtahlborn2012年07月20日 03:29:04 +00:00Commented Jul 20, 2012 at 3:29
5 Answers 5
Yes. To take a simple example, the string "¡" (the inverted exclamation mark) encoded as ISO-8859-1 and the string "Ą" (capital A with ogoned) encodes as ISO-8859-2 both become the single-byte sequence A1 (hex). It is more or less obvious that such things happen when using the very simple encodings that map characters to single bytes; otherwise they would not be different encodings. It can surely happen when more complicated encoding schemes are involved, too.
Comments
Here's an easy one: most codepages and UTF-8 share the ASCII encoding (0x00 = 0x7F). If your text is in plain English, there's a big chance that it's in ASCII -- whatever the declared encoding, since it'd use mostly plain, non-accented characters.
1 Comment
If the source string is in an encoding that supports multi-byte characters and the target encoding is one that does not support multi-byte characters, it seems reasonable that one could get a collision since multi-byte characters will require a mapping to a single byte character set.
For example if the input strings are written in Chinese and the target character set is US-ASCII, many Chinese characters will certainly be mapped to the same US-ASCII representation.
Comments
This code should produce an example eventually:
while(true){
Random r = new Random();
byte[] bytes = new byte[4];
r.nextBytes(bytes);
try{
String raw = Arrays.toString(bytes);
String utf8 = new String(bytes, "UTF-8");
String latin1 = new String(bytes, "ISO-LATIN-1");
System.out.println(raw + " is " + utf8 + " or " + latin1);
break;
}catch(Exception e){}
}
Comments
Yes, it is possible, at least for strings of different lengths.
The string "\u2020"
(or "†"
) is encoded as 0x20,0x20
in UTF-16. This is also what "\x20\x20"
(a string of two ASCII spaces) is encoded to in ASCII.
Of course, The Dagger, doesn't come up in language very often [=^_^=], but some standard [non-Latin] alphabets could generate similar byte-sequences that map onto a standard (non-control character) ASCII encoding .. and many more if the restriction about control characters relaxed.
It would would be more interesting to find a case where two similar "realistic" strings (e.g. same length and "sensible data") could map onto the same byte-sequence with different encodings ..
Comments
Explore related questions
See similar questions with these tags.