Unique short string for URL

Question 1

I needed short, unique, seemingly random URLs for a project I'm working on, so I put together the following after some research. I believe it's working as expected, but I want to know if there's a better or simpler way of doing this. Basically, it takes an integer and returns a string of the specified length using the character list. I'm new to Python, so help is much appreciated.

def genkey(value, length=5, chars='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890', prime=694847539):
 digits = []
 base = len(chars)
 seed = base ** (length - 1)
 domain = (base ** length) - seed
 num = ((((value - 1) + seed) * prime) % domain) + seed
 while num:
 digits.append(chars[num % base])
 num /= base
 return ''.join(reversed(digits))

Question 2

Without knowing your concrete requirements: What do you think about hashing your value and take the first x characters? Maybe you will need some additional checking for duplicates.

Question 3

@mnhg I thought about that as well. This sort of became an exercise for its own sake at some point. How would you suggest handling duplicates?

Question 4

As this seems not performance critical, I would simple check the existing in the indexed database column that is storing the short url mapping. (Might also be a hashmap.) If I find a conflict I would a attached the current time/spaces/whatever and try it again or just rehash the hash once again.

Question 5

The trouble with using a pseudo-random number generator to produce your shortened URLs is, what do you do if there is a collision? That is, what happens if there are values v and w such that v != w but genkey(v) == genkey(w)? Would this be a problem for your application, or would it be entirely fine?

Anyway, if I didn't need to solve the collision problem, I would use Python's built-in pseudo-random number generator for this instead of writing my own. Also, I would add a docstring and some doctests.

import random
import string
def genkey(value, length = 5, chars = string.ascii_letters + string.digits):
 """
 Return a string of `length` characters chosen pseudo-randomly from
 `chars` using `value` as the seed.
 >>> ' '.join(genkey(i) for i in range(5))
 '0UAqF i0VpE 76dfZ oHwLM ogyje'
 """
 random.seed(value)
 return ''.join(random.choice(chars) for _ in xrange(length))

Update: you clarified in comments that you do need to avoid collisions, and moreover you know that value is a number between 1 and domain inclusive. In that case, you're right that your transformation is an injection, since prime is coprime to domain, so your method is fine, but there are several things that can be done to simplify it:

As far as I can see, there's no need to subtract 1 from value.
There's no need to use the value seed at all. You're using this to ensure that you have at least length digits in num, but it's easier just to generate exactly length digits. (This gives you a bigger domain in any case.)
There's no need to call reversed: the reverse of a pseudo-random string is also a pseudo-random string.

Applying all those simplifications yields the following:

def genkey(value, length=5, chars=string.ascii_letters + string.digits, prime=694847539):
 """
 Return a string of `length` characters chosen pseudo-randomly from
 `chars` using `value` as the seed and `prime` as the multiplier.
 `value` must be a number between 1 and `len(chars) ** length`
 inclusive.
 >>> ' '.join(genkey(i) for i in range(1, 6))
 'xKFbV UkbdG hVGer Evcgc 15HhX'
 """
 base = len(chars)
 domain = base ** length
 assert(1 <= value <= domain)
 n = value * prime % domain
 digits = []
 for _ in xrange(length):
 n, c = divmod(n, base)
 digits.append(chars[c])
 return ''.join(digits)

A couple of things you might want to beware of:

This pseudo-random scheme is not cryptographically strong: that is, it's fairly easy to go back from the URL to the value that produced it. This can be a problem in some applications.
The random strings produced by this scheme may include real words or names in human languages. This could be unfortunate in some applications, if the resulting words were offensive or otherwise inappropriate.

Question 6

With the system I have in place now, collisions would be a problem. I'm under the impression that the code I wrote would not produce collisions for values within the domain (901,356,496 for length=5 and base=62). Do you know of any ways to test this?

Question 7

Thanks for the great in-depth answer. How would you go about getting the initial value from the URL?

Question 8

Extended Euclidean algorithm.

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2013-02-21 11:25:57Z

The trouble with using a pseudo-random number generator to produce your shortened URLs is, what do you do if there is a collision? That is, what happens if there are values v and w such that v != w but genkey(v) == genkey(w)? Would this be a problem for your application, or would it be entirely fine?

Anyway, if I didn't need to solve the collision problem, I would use Python's built-in pseudo-random number generator for this instead of writing my own. Also, I would add a docstring and some doctests.

import random
import string
def genkey(value, length = 5, chars = string.ascii_letters + string.digits):
 """
 Return a string of `length` characters chosen pseudo-randomly from
 `chars` using `value` as the seed.
 >>> ' '.join(genkey(i) for i in range(5))
 '0UAqF i0VpE 76dfZ oHwLM ogyje'
 """
 random.seed(value)
 return ''.join(random.choice(chars) for _ in xrange(length))

Update: you clarified in comments that you do need to avoid collisions, and moreover you know that value is a number between 1 and domain inclusive. In that case, you're right that your transformation is an injection, since prime is coprime to domain, so your method is fine, but there are several things that can be done to simplify it:

As far as I can see, there's no need to subtract 1 from value.
There's no need to use the value seed at all. You're using this to ensure that you have at least length digits in num, but it's easier just to generate exactly length digits. (This gives you a bigger domain in any case.)
There's no need to call reversed: the reverse of a pseudo-random string is also a pseudo-random string.

Applying all those simplifications yields the following:

def genkey(value, length=5, chars=string.ascii_letters + string.digits, prime=694847539):
 """
 Return a string of `length` characters chosen pseudo-randomly from
 `chars` using `value` as the seed and `prime` as the multiplier.
 `value` must be a number between 1 and `len(chars) ** length`
 inclusive.
 >>> ' '.join(genkey(i) for i in range(1, 6))
 'xKFbV UkbdG hVGer Evcgc 15HhX'
 """
 base = len(chars)
 domain = base ** length
 assert(1 <= value <= domain)
 n = value * prime % domain
 digits = []
 for _ in xrange(length):
 n, c = divmod(n, base)
 digits.append(chars[c])
 return ''.join(digits)

A couple of things you might want to beware of:

This pseudo-random scheme is not cryptographically strong: that is, it's fairly easy to go back from the URL to the value that produced it. This can be a problem in some applications.
The random strings produced by this scheme may include real words or names in human languages. This could be unfortunate in some applications, if the resulting words were offensive or otherwise inappropriate.

With the system I have in place now, collisions would be a problem. I'm under the impression that the code I wrote would not produce collisions for values within the domain (901,356,496 for length=5 and base=62). Do you know of any ways to test this?
Thanks for the great in-depth answer. How would you go about getting the initial value from the URL?

Stack Exchange Network

Unique short string for URL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Unique short string for URL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions