6
\$\begingroup\$

I needed short, unique, seemingly random URLs for a project I'm working on, so I put together the following after some research. I believe it's working as expected, but I want to know if there's a better or simpler way of doing this. Basically, it takes an integer and returns a string of the specified length using the character list. I'm new to Python, so help is much appreciated.

def genkey(value, length=5, chars='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890', prime=694847539):
 digits = []
 base = len(chars)
 seed = base ** (length - 1)
 domain = (base ** length) - seed
 num = ((((value - 1) + seed) * prime) % domain) + seed
 while num:
 digits.append(chars[num % base])
 num /= base
 return ''.join(reversed(digits))
asked Feb 21, 2013 at 5:51
\$\endgroup\$
3
  • 1
    \$\begingroup\$ Without knowing your concrete requirements: What do you think about hashing your value and take the first x characters? Maybe you will need some additional checking for duplicates. \$\endgroup\$ Commented Feb 21, 2013 at 8:18
  • \$\begingroup\$ @mnhg I thought about that as well. This sort of became an exercise for its own sake at some point. How would you suggest handling duplicates? \$\endgroup\$ Commented Feb 21, 2013 at 16:33
  • \$\begingroup\$ As this seems not performance critical, I would simple check the existing in the indexed database column that is storing the short url mapping. (Might also be a hashmap.) If I find a conflict I would a attached the current time/spaces/whatever and try it again or just rehash the hash once again. \$\endgroup\$ Commented Feb 22, 2013 at 5:02

1 Answer 1

5
\$\begingroup\$

The trouble with using a pseudo-random number generator to produce your shortened URLs is, what do you do if there is a collision? That is, what happens if there are values v and w such that v != w but genkey(v) == genkey(w)? Would this be a problem for your application, or would it be entirely fine?

Anyway, if I didn't need to solve the collision problem, I would use Python's built-in pseudo-random number generator for this instead of writing my own. Also, I would add a docstring and some doctests.

import random
import string
def genkey(value, length = 5, chars = string.ascii_letters + string.digits):
 """
 Return a string of `length` characters chosen pseudo-randomly from
 `chars` using `value` as the seed.
 >>> ' '.join(genkey(i) for i in range(5))
 '0UAqF i0VpE 76dfZ oHwLM ogyje'
 """
 random.seed(value)
 return ''.join(random.choice(chars) for _ in xrange(length))

Update: you clarified in comments that you do need to avoid collisions, and moreover you know that value is a number between 1 and domain inclusive. In that case, you're right that your transformation is an injection, since prime is coprime to domain, so your method is fine, but there are several things that can be done to simplify it:

  1. As far as I can see, there's no need to subtract 1 from value.

  2. There's no need to use the value seed at all. You're using this to ensure that you have at least length digits in num, but it's easier just to generate exactly length digits. (This gives you a bigger domain in any case.)

  3. There's no need to call reversed: the reverse of a pseudo-random string is also a pseudo-random string.

Applying all those simplifications yields the following:

def genkey(value, length=5, chars=string.ascii_letters + string.digits, prime=694847539):
 """
 Return a string of `length` characters chosen pseudo-randomly from
 `chars` using `value` as the seed and `prime` as the multiplier.
 `value` must be a number between 1 and `len(chars) ** length`
 inclusive.
 >>> ' '.join(genkey(i) for i in range(1, 6))
 'xKFbV UkbdG hVGer Evcgc 15HhX'
 """
 base = len(chars)
 domain = base ** length
 assert(1 <= value <= domain)
 n = value * prime % domain
 digits = []
 for _ in xrange(length):
 n, c = divmod(n, base)
 digits.append(chars[c])
 return ''.join(digits)

A couple of things you might want to beware of:

  1. This pseudo-random scheme is not cryptographically strong: that is, it's fairly easy to go back from the URL to the value that produced it. This can be a problem in some applications.

  2. The random strings produced by this scheme may include real words or names in human languages. This could be unfortunate in some applications, if the resulting words were offensive or otherwise inappropriate.

answered Feb 21, 2013 at 11:25
\$\endgroup\$
3
  • \$\begingroup\$ With the system I have in place now, collisions would be a problem. I'm under the impression that the code I wrote would not produce collisions for values within the domain (901,356,496 for length=5 and base=62). Do you know of any ways to test this? \$\endgroup\$ Commented Feb 21, 2013 at 16:29
  • \$\begingroup\$ Thanks for the great in-depth answer. How would you go about getting the initial value from the URL? \$\endgroup\$ Commented Feb 21, 2013 at 20:44
  • \$\begingroup\$ Extended Euclidean algorithm. \$\endgroup\$ Commented Feb 21, 2013 at 20:49

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.