MySQL: Storing unique URLs

Question 1

I am creating a table wich will contain user-provided URLs. I want those to be unique, so when the user gives me a URL I will first check if the URL exists and if so return the ID for the entry. If not create a new row with this URL.

Obviously I want this to be fast. What is the best option?

Make the actual URL a varchar that is UNIQUE and look by this url?
Make a hash of the URL and use it as a primary key of sort?
Other ideas?

Question 2

Hope you don't mind. I removed the PS section. We'll let you know if it's not a good fit by closing and/or downvoting!

Question 3

I would definitely go with a hash of the url and make the hash a unique index. A hash has a fixed length, so you can use CHAR to specify the length of the column, which grants a slight performance boost over VARCHAR or TEXT.

But might I suggest using INSERT IGNORE instead of making two calls to the database? Something like:

INSERT IGNORE INTO urlTable VALUES ('urlHash');

This has the benefit of ignoring any duplicate errors that might arise from attempting to insert a duplicate hash, without first having to do a SELECT COUNT(*) query.

Question 4

Your approach is more concise. +1 !!!

Question 5

I need the ID of the row, can I get it when doing insert ignore?

Question 6

Actually, do I still need a separate primary ID? Or should the hash be my primary key? Should I hash in MySQL, or in PHP?

Question 7

I just tested that SELECT LAST_INSERT_ID() will not return the ID on the value that was ignored (in the case of a duplicate). So you will either need to do a SELECT id FROM url WHERE urlHash='X', or drop the ambiguous primary key. It depends on your use-case. If this table actually has other columns other than the URL that you're indexing on, I'd recommend the first option and keep that auto-incrementing ID.

Question 8

Is there a way to make mysql fail silently if you try to write a duplicate or do you always have to query the table to check for the hash? Otherwise, I'm getting SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '211c2f38f92d7ad4380031dc533d376a' for key 'guid_hash'

Question 9

Unless I'm missing something, you should just create a UNIQUE index of the type HASH. I don't see what adding your own hash and triggers would add? And have the field itself NOT NULL.

CREATE TABLE `test`.`bla` (
 `id` INT NOT NULL AUTO_INCREMENT,
 `text` VARCHAR(45) NOT NULL,
 PRIMARY KEY (`id`),
 UNIQUE INDEX `text_UNIQUE` USING HASH (`text`)
);

Question 10

Interesting idea...though according to the docs, HASH isn't available for InnoDB: dev.mysql.com/doc/refman/5.5/en/create-index.html Oddly, it doesn't throw a warning when creating it like that. But the docs indicate that it will use BTREE (for innodb) silently, though the definition says HASH.

Question 11

Ah good point! Sounds like something that should be mentioned in whatever answer is approved or maybe in the question itself. In any case, this question sheds some more light on this: dba.stackexchange.com/questions/2817/…

Derek Downey Derek Downey 23.6k11 gold badges79 silver badges104 bronze badges · Accepted Answer · 2012-01-19 17:07:21Z

7

I would definitely go with a hash of the url and make the hash a unique index. A hash has a fixed length, so you can use CHAR to specify the length of the column, which grants a slight performance boost over VARCHAR or TEXT.

But might I suggest using INSERT IGNORE instead of making two calls to the database? Something like:

INSERT IGNORE INTO urlTable VALUES ('urlHash');

This has the benefit of ignoring any duplicate errors that might arise from attempting to insert a duplicate hash, without first having to do a SELECT COUNT(*) query.

Share

Improve this answer

answered Jan 19, 2012 at 17:07

Derek Downey's user avatar

Derek Downey Derek Downey

23.6k11 gold badges79 silver badges104 bronze badges

6

Your approach is more concise. +1 !!!

RolandoMySQLDBA
– RolandoMySQLDBA

2012年01月19日 17:09:53 +00:00
Commented Jan 19, 2012 at 17:09
I need the ID of the row, can I get it when doing insert ignore?

nute
– nute

2012年01月22日 09:20:43 +00:00
Commented Jan 22, 2012 at 9:20
Actually, do I still need a separate primary ID? Or should the hash be my primary key? Should I hash in MySQL, or in PHP?

nute
– nute

2012年01月22日 09:24:46 +00:00
Commented Jan 22, 2012 at 9:24
I just tested that SELECT LAST_INSERT_ID() will not return the ID on the value that was ignored (in the case of a duplicate). So you will either need to do a SELECT id FROM url WHERE urlHash='X', or drop the ambiguous primary key. It depends on your use-case. If this table actually has other columns other than the URL that you're indexing on, I'd recommend the first option and keep that auto-incrementing ID.

Derek Downey
– Derek Downey

2012年01月23日 15:47:26 +00:00
Commented Jan 23, 2012 at 15:47
Is there a way to make mysql fail silently if you try to write a duplicate or do you always have to query the table to check for the hash? Otherwise, I'm getting SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '211c2f38f92d7ad4380031dc533d376a' for key 'guid_hash'

codecowboy
– codecowboy

2012年03月08日 10:45:30 +00:00
Commented Mar 8, 2012 at 10:45

| Show 1 more comment

Stack Exchange Network

MySQL: Storing unique URLs

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

MySQL: Storing unique URLs

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions