I have a database that has two columns. The first column is an index, the second is the path to a data file. There are two types of data files, X and Y. These data files are then processed and graphs are created from them. So some examples of the rows look like this:
ID___| FilePath
0001 | /X/datafile1wfre.dat
0023 | /X/datafile89_jncd.dat
2349 | /Y/datafile983jew_un.dat
3984 | /Y/datafileindj389.dat
I am then taking this table, choosing a random row from it, and showing the graph of the data file to the user. After they have spent time looking at the graph, I am then going to ask them, do you think this data file is X or Y?
Let's say someone looks at a graph and this person would like to view that graph at a later point in time. I would then give them an ID of the row. Note: There are ~4000 entries in the table.
Here's the issue, the way to file paths are added to the table, the first half of the table are the paths of X (ID's 0001 - 2000), and the second half of the database are the paths of Y (ID's 2001 - 4000). Someone could easily figure this out and once they see the ID, they would be able to make a prediction of it being X or Y just based on if the ID is above or below 2000.
Here's my goal. I would like to have an algorithm that can take a 4 digit number A and make another (different) 4 digit number B. I want B to be unique to A, no other 4 digit number could make B except A. Here's an example:
0239 would create 9834
7783 would create 3892
9834 is unique to 0239. No matter what 4 digit number you have, the only way to get 9834 is from 0239. Same with 3892, the only way to get 3892 is from 7783.
This way, I can give the generated 4 digit from the algorithm to the user without having them see the Real ID from the table.
-
Could someone help explain how this was unclear? I think I laid out what I'm trying to do clearly but it may need some help, can someone please tell me which part is unclear so I can improve my question?Tom– Tom2016年07月26日 15:11:45 +00:00Commented Jul 26, 2016 at 15:11
-
Do this once: for each number between 1 and 9999, toss a coin. Heads, it's X, Tails, it's Y. Instead of tossing a coin, use a random number generator to deliver numbers between 0 and 1. If the value is < 0.5, it's X otherwise it's Y. If you need the number of Xs and Ys to match, you'll need to tweak the output.Dan Pichelman– Dan Pichelman2016年07月26日 15:13:19 +00:00Commented Jul 26, 2016 at 15:13
-
@DanPichelman That might work but I'd like an algorithm that I can use instead of iterating through every number from 0 to 9999Tom– Tom2016年07月26日 15:14:44 +00:00Commented Jul 26, 2016 at 15:14
-
1Second paragraph is particularly indecipherable.Tulains Córdova– Tulains Córdova2016年07月26日 15:16:52 +00:00Commented Jul 26, 2016 at 15:16
-
3I voted for reopen now that the question is a lot more clear.Tulains Córdova– Tulains Córdova2016年07月26日 15:43:57 +00:00Commented Jul 26, 2016 at 15:43
2 Answers 2
There are fundamentally different solutions to this problem as you've presented it.
A truly random mapping from number A (private ID) to number B (public ID) can be created if you have an independent source of random. Every time you create another row and are assigned number A read from random and create number B. To ensure B is unique you will have to search all existing B's before assigning it. This would be the hardest for anyone to reverse engineer. It's basicly what cryptography calls a one time pad. It's also increasingly prohibitive as you get closer and closer to fully populating the space you've allowed. You eventually get to where there is only 1 number left to assign as a B. You have to wait to find it randomly and you have to search to prove uniqueness on every attempt.
A fixed transformation of number A into B by a function. This avoids becoming prohibitive even when fully populating the space. It also risks the user guessing the algorithm. This can be mitigated if instead of simply using a hash to do this, you encrypt the number A. There are encryption algorithms that produce the same size crypto text as plain text and take a cryptovariable (key). Done this way it wouldn't matter if they guessed how you created B so long as the cryptovariable (key) was still a secret. You would want to use a format preserving encryption. This gives you the ability to predict A from a B but if you index B on the database this shouldn't be needed.
If you feel that is overkill you could look into a shuffle that simply obfuscates A. This risks them guessing the shuffle unless it also uses a crypto variable.
It's also worth considering if A is even still needed. If the only thing A provides is a unique identifier then there is no point in being able to convert back to A and no reason to store A in the database when B is all that is needed. This means all you have to do is uniquely randomize the auto increment ID because this will give you B to start with. Some DB's already provide this. This way you have unique id's that don't predict x or y and avoid an unneeded level of indirection.
Considering that:
- The number of IDs is finite and small (roughly 2k rows in one table and 2k in the other table)
- The "generated alias ID" must be composed of 4 numbers, meaning the number of collisiones will be too high if you try any hash function.
Then I recommend:
- You preload the alias ID in another column of the same table, so you don't have to calculate it in real time.
I created this solution using bash to create the pairs, maybe you can replicate in another language:
I created a file containing strings from 0000 to 4000
$ seq -f "%04g" 0 4000 > /tmp/data.txt
I created a second file with strings 0000 to 9999 (this will be the fake IDs)
$ seq -f "%04g" 0 9999 > /tmp/data2.txt
I scrambled the second file:
$ sort -U /tmp/data2.txt > /tmp/data3.txt
I truncated the resulting file to exactly the first 4001 rows
$ head -n 4001 > /tmp/data4.txt
Then I paired all original IDs with their alias IDs
$ paste data.txt data4.txt > data5.txt
After that you have a file (data5.txt) that you can use to populate both your tables with the sequential ID and it's random 4 number ID.
0000 3675
0001 2464
0002 1808
0003 9569
0004 3309
...
3996 9843
3997 7497
3998 7892
3999 3062
4000 5687