I needed short, unique, seemingly random URLs for a project I'm working on, so I put together the following after some research. I believe it's working as expected, but I want to know if there's a better or simpler way of doing this. Basically, it takes an integer and returns a string of the specified length using the character list. I'm new to Python, so help is much appreciated.
def genkey(value, length=5, chars='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890', prime=694847539):
digits = []
base = len(chars)
seed = base ** (length - 1)
domain = (base ** length) - seed
num = ((((value - 1) + seed) * prime) % domain) + seed
while num:
digits.append(chars[num % base])
num /= base
return ''.join(reversed(digits))
-
1\$\begingroup\$ Without knowing your concrete requirements: What do you think about hashing your value and take the first x characters? Maybe you will need some additional checking for duplicates. \$\endgroup\$mheinzerling– mheinzerling2013年02月21日 08:18:10 +00:00Commented Feb 21, 2013 at 8:18
-
\$\begingroup\$ @mnhg I thought about that as well. This sort of became an exercise for its own sake at some point. How would you suggest handling duplicates? \$\endgroup\$davishmcclurg– davishmcclurg2013年02月21日 16:33:51 +00:00Commented Feb 21, 2013 at 16:33
-
\$\begingroup\$ As this seems not performance critical, I would simple check the existing in the indexed database column that is storing the short url mapping. (Might also be a hashmap.) If I find a conflict I would a attached the current time/spaces/whatever and try it again or just rehash the hash once again. \$\endgroup\$mheinzerling– mheinzerling2013年02月22日 05:02:15 +00:00Commented Feb 22, 2013 at 5:02
1 Answer 1
The trouble with using a pseudo-random number generator to produce your shortened URLs is, what do you do if there is a collision? That is, what happens if there are values v
and w
such that v != w
but genkey(v) == genkey(w)
? Would this be a problem for your application, or would it be entirely fine?
Anyway, if I didn't need to solve the collision problem, I would use Python's built-in pseudo-random number generator for this instead of writing my own. Also, I would add a docstring and some doctests.
import random
import string
def genkey(value, length = 5, chars = string.ascii_letters + string.digits):
"""
Return a string of `length` characters chosen pseudo-randomly from
`chars` using `value` as the seed.
>>> ' '.join(genkey(i) for i in range(5))
'0UAqF i0VpE 76dfZ oHwLM ogyje'
"""
random.seed(value)
return ''.join(random.choice(chars) for _ in xrange(length))
Update: you clarified in comments that you do need to avoid collisions, and moreover you know that value
is a number between 1 and domain
inclusive. In that case, you're right that your transformation is an injection, since prime
is coprime to domain
, so your method is fine, but there are several things that can be done to simplify it:
As far as I can see, there's no need to subtract 1 from
value
.There's no need to use the value
seed
at all. You're using this to ensure that you have at leastlength
digits innum
, but it's easier just to generate exactlylength
digits. (This gives you a bigger domain in any case.)There's no need to call
reversed
: the reverse of a pseudo-random string is also a pseudo-random string.
Applying all those simplifications yields the following:
def genkey(value, length=5, chars=string.ascii_letters + string.digits, prime=694847539):
"""
Return a string of `length` characters chosen pseudo-randomly from
`chars` using `value` as the seed and `prime` as the multiplier.
`value` must be a number between 1 and `len(chars) ** length`
inclusive.
>>> ' '.join(genkey(i) for i in range(1, 6))
'xKFbV UkbdG hVGer Evcgc 15HhX'
"""
base = len(chars)
domain = base ** length
assert(1 <= value <= domain)
n = value * prime % domain
digits = []
for _ in xrange(length):
n, c = divmod(n, base)
digits.append(chars[c])
return ''.join(digits)
A couple of things you might want to beware of:
This pseudo-random scheme is not cryptographically strong: that is, it's fairly easy to go back from the URL to the value that produced it. This can be a problem in some applications.
The random strings produced by this scheme may include real words or names in human languages. This could be unfortunate in some applications, if the resulting words were offensive or otherwise inappropriate.
-
\$\begingroup\$ With the system I have in place now, collisions would be a problem. I'm under the impression that the code I wrote would not produce collisions for values within the
domain
(901,356,496 for length=5 and base=62). Do you know of any ways to test this? \$\endgroup\$davishmcclurg– davishmcclurg2013年02月21日 16:29:09 +00:00Commented Feb 21, 2013 at 16:29 -
\$\begingroup\$ Thanks for the great in-depth answer. How would you go about getting the initial value from the URL? \$\endgroup\$davishmcclurg– davishmcclurg2013年02月21日 20:44:47 +00:00Commented Feb 21, 2013 at 20:44
-
\$\begingroup\$ Extended Euclidean algorithm. \$\endgroup\$Gareth Rees– Gareth Rees2013年02月21日 20:49:18 +00:00Commented Feb 21, 2013 at 20:49