3

Can two different strings when encoded with different encodings have the same byte sequence? i.e. some "string one" and "string two" in the example below when encoded using two different encodings (Cp1252 and UTF-8 are just examples) will cause the test to pass?

import java.io.UnsupportedEncodingException;
import java.util.Arrays;
import org.junit.Assert;
import org.junit.Test;
public class EncodingTest {
 @Test
 public void test() throws UnsupportedEncodingException {
 final byte[] sequence1 = "string one".getBytes("Cp1252");
 final byte[] sequence2 = "string two".getBytes("UTF-8");
 Assert.assertTrue(Arrays.equals(sequence1, sequence2));
 }
}

A bug in my code hashes byte sequence generated from a String with JVM's default encoding and I need to verify whether that will cause hash collisions when the code is run with different strings and different JVM file encodings (which can happen when run on Windows and Linux for example).

Since an encoding is a mapping between byte sequences and characters, I think there may be some strings and encodings that pass the above test. But just wanted to know if there are any well known examples or some good reasons for why I shouldn't be relying on hash collisions not happening.

Thanks

PS: This is only for encodings supported by JDK 1.6 and not by some made up ones.

asked Jul 20, 2012 at 1:26
4
  • Using the "default encoding" is often .. suspect. Commented Jul 20, 2012 at 2:01
  • 1
    Note that this question is asking the inverse of what some answers are responding to; it is not if two identical strings with different encodings can generate the same byte-sequence, it is asking if two different strings with different encodings can generate the same byte-sequence. (And more specifically, if there is a "known case" of such a collisions.) Commented Jul 20, 2012 at 2:03
  • why do hash collisions matter? hash codes don't need to be unique. Commented Jul 20, 2012 at 3:23
  • you do realize that unless you live in a world with at most 4 billion possible Strings, then there will, by definition, be hash collisions. Commented Jul 20, 2012 at 3:29

5 Answers 5

2

Yes. To take a simple example, the string "¡" (the inverted exclamation mark) encoded as ISO-8859-1 and the string "Ą" (capital A with ogoned) encodes as ISO-8859-2 both become the single-byte sequence A1 (hex). It is more or less obvious that such things happen when using the very simple encodings that map characters to single bytes; otherwise they would not be different encodings. It can surely happen when more complicated encoding schemes are involved, too.

answered Jul 20, 2012 at 4:47

Comments

1

Here's an easy one: most codepages and UTF-8 share the ASCII encoding (0x00 = 0x7F). If your text is in plain English, there's a big chance that it's in ASCII -- whatever the declared encoding, since it'd use mostly plain, non-accented characters.

answered Jul 20, 2012 at 1:40

1 Comment

How does that lead to two different input strings that end up with the same bytes from getBytes()?
1

If the source string is in an encoding that supports multi-byte characters and the target encoding is one that does not support multi-byte characters, it seems reasonable that one could get a collision since multi-byte characters will require a mapping to a single byte character set.

For example if the input strings are written in Chinese and the target character set is US-ASCII, many Chinese characters will certainly be mapped to the same US-ASCII representation.

answered Jul 20, 2012 at 1:43

Comments

1

This code should produce an example eventually:

 while(true){
 Random r = new Random();
 byte[] bytes = new byte[4];
 r.nextBytes(bytes);
 try{
 String raw = Arrays.toString(bytes);
 String utf8 = new String(bytes, "UTF-8");
 String latin1 = new String(bytes, "ISO-LATIN-1");
 System.out.println(raw + " is " + utf8 + " or " + latin1);
 break;
 }catch(Exception e){}
 }
answered Jul 20, 2012 at 2:13

Comments

1

Yes, it is possible, at least for strings of different lengths.

The string "\u2020" (or "†") is encoded as 0x20,0x20 in UTF-16. This is also what "\x20\x20" (a string of two ASCII spaces) is encoded to in ASCII.

Of course, The Dagger, doesn't come up in language very often [=^_^=], but some standard [non-Latin] alphabets could generate similar byte-sequences that map onto a standard (non-control character) ASCII encoding .. and many more if the restriction about control characters relaxed.

It would would be more interesting to find a case where two similar "realistic" strings (e.g. same length and "sensible data") could map onto the same byte-sequence with different encodings ..

answered Jul 20, 2012 at 2:10

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.