7

I'm trying to design a database for a program that is storing and manipulating tweets off of Twitter.

I'm using code to fragment a tweet into words, usernames, and hashtags, so

I'm meeting @President over coffee to talk about my new job #cabinet #woot

would break down into {meeting, over, coffee, talk, about, new, job} (getting rid of words under 2 letters and after stripping punctuation), {president}, and {cabinet,woot}.

With my lack of experience in this area, I'd like to just be able to grow the number of rows in a column as necessary, but my intuition tells me that this would be a nightmare. Another possibility I've considered is just picking an arbitrary number like 10 columns to store words, 5 for usernames, etc., but this would be roughing it a bit, as I don't know for sure how many there will be, and I'd have to throw out what may be valuable information.

As it stands now, I have 3 varchars of sufficient length which hold "meeting over coffee talk about new job", "president", and "cabinet woot". So, whenever I need to grab the information for a particular tweet, I retrieve the row and break up the strings.

Is this the "correct" route? Am I missing an obvious way that will make my life easier? (I had found something online about using array, but I'm using MySQL and it doesn't seem to support that datatype.

asked Jan 8, 2012 at 7:55

1 Answer 1

10

Edited after a couple of comments by the question asker!:

Better idea for your design: new sample schema

(If you're not familar with these diagrams, that's seven tables, with foreign key links as pictured [the three _Link tables all have two foreign keys referencing the data tables]. It probably looks like I'm overcomplicating things, but trust me, "tokenise once and store in the database" is so much more efficient than "tokenise data each time you retrieve it".)

Sample data, using a simple tweet (ignoring Users for now, it's the same concept though): "test tweet! #howdoesthiswork #newbie"

 Tweet
TweetID WordCount
 1 2
Tweet_Word_Link
TweetID WordID
 1 1
 1 2
 Word
WordID Word
 1 test
 2 tweet
Tweet_Hashtag_Link
TweetID HashtagID
 1 1
 1 2
 Hashtag
HashtagID Hashtag
 1 #howdoesthiswork
 2 #newbie

This allows you to easily get all the users linked in a tweet:

SELECT U.UserName
FROM User U
 INNER JOIN
 Tweet_User_Link TUL ON U.UserID = TUL.UserID
 INNER JOIN
 Tweet T ON TUL.TweetID = T.TweetID
WHERE T.TweetID = <your ID here>
 

(change table names as appropriate for Hashtags and Words)

Or, slightly more complicated, get the top 10 words for all tweets mentioning a specific hashtag:

SELECT TOP 10 W.Word, COUNT(*) AS WordCount
FROM Word W
 INNER JOIN
 Tweet_Word_Link TWL ON TWL.WordID = W.WordID
 INNER JOIN
 Tweet T ON TWL.TweetID = T.TweetID
 INNER JOIN
 Tweet_Hashtag_Link THL ON THL.TweetID = T.TweetID
 INNER JOIN
 Hashtag H ON H.HashtagID = THL.HashtagID
WHERE H.Hashtag = <your hashtag here>
GROUP By W.Word

(I've used ANSI SQL syntax, I'm not 100% sure if MySQL supports all of ANSI SQL - but it should be close enough.)

answered Jan 8, 2012 at 8:48
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.