More efficient way to make a Twitter status in a string of just words

Question 1

I am making an application where I will be fetching tweets and storing them in a database. I will have a column for the complete text of the tweet and another where only the words of the tweet will remain (I need the words to calculate which words were most used later).

How I currently do it is by using 6 different .replaceAll() functions which some of them might be triggered twice. For example I will have a for loop to remove every "hashtag" using replaceAll().

The problem is that I will be editing as many as thousands of tweets that I fetch every few minutes and I think that the way I am doing it will not be too efficient.

What my requirements are in this order (also written in comments down below):

Delete all usernames mentioned
Delete all RT (retweets flags)
Delete all hashtags mentioned
Replace all break lines with spaces
Replace all double spaces with single spaces
Delete all special characters except spaces

Here is a short and compilable example:

public class StringTest {
 public static void main(String args[]) {
 String text = "RT @AshStewart09: Vote for Lady Gaga for \"Best Fans\""
 + " at iHeart Awards\n"
 + "\n"
 + "RT!!\n"
 + "\n"
 + "My vote for #FanArmy goes to #LittleMonsters #iHeartAwards"
 + " htt...";
 String[] hashtags = {"#FanArmy", "#LittleMonsters", "#iHeartAwards"};
 System.out.println("Before: " + text + "\n");
 // Delete all usernames mentioned (may run multiple times)
 text = text.replaceAll("@AshStewart09", "");
 System.out.println("First Phase: " + text + "\n");
 // Delete all RT (retweets flags)
 text = text.replaceAll("RT", "");
 System.out.println("Second Phase: " + text + "\n");
 // Delete all hashtags mentioned
 for (String hashtag : hashtags) {
 text = text.replaceAll(hashtag, "");
 }
 System.out.println("Third Phase: " + text + "\n");
 // Replace all break lines with spaces
 text = text.replaceAll("\n", " ");
 System.out.println("Fourth Phase: " + text + "\n");
 // Replace all double spaces with single spaces
 text = text.replaceAll(" +", " ");
 System.out.println("Fifth Phase: " + text + "\n");
 // Delete all special characters except spaces 
 text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();
 System.out.println("Finaly: " + text);
 }
}

Question 2

(subject to further changes)

In your simple example, how are the hashtags and usernames actually derived from the tweet?

My suggestion is to tokenize the tweet by whitespaces first, then look at the individual words to determine if they must be stored ("Vote") or discarded ("#LittleMonsters").

 // Delete all RT (retweets flags)
 text = text.replaceAll("RT", "");

You do realize that this will turn text like "ART!" into just "A!" right? Tokenizing first should remedy this issue.

On a related note, Apache Incubator Storm's tutorials usually use tweets as an example to demonstrate the Big Data approach. I'm not suggesting that you need such a set-up in your context, but perhaps you can give those a quick read-through to pick up some tips.

Question 3

I am using the twitter4j library to grab the tweets. Every tweet has the text field and it also has a usernameMentioned array (where all the mentioned usernames are kept) and a hashtags where all the hashtags in the tweet are stored. As for the RT I could probably get away with this by using .replaceFirst() I guess.

Question 4

Thanks for the explanation. One can always type before "RT ..." right? The issue will not go away with .replaceFirst() then... :)

Question 5

I have changed the code a bit (also adding link removal functionality and taking care of the RT issue) but even though the code definitely looks better I am not sure if it is quite efficient since I am still using 5 .replaceAll().

h.j.k. h.j.k. 19.3k3 gold badges37 silver badges93 bronze badges · Accepted Answer · 2014-04-17 02:16:55Z

(subject to further changes)

In your simple example, how are the hashtags and usernames actually derived from the tweet?

My suggestion is to tokenize the tweet by whitespaces first, then look at the individual words to determine if they must be stored ("Vote") or discarded ("#LittleMonsters").

 // Delete all RT (retweets flags)
 text = text.replaceAll("RT", "");

You do realize that this will turn text like "ART!" into just "A!" right? Tokenizing first should remedy this issue.

On a related note, Apache Incubator Storm's tutorials usually use tweets as an example to demonstrate the Big Data approach. I'm not suggesting that you need such a set-up in your context, but perhaps you can give those a quick read-through to pick up some tips.

I am using the twitter4j library to grab the tweets. Every tweet has the text field and it also has a usernameMentioned array (where all the mentioned usernames are kept) and a hashtags where all the hashtags in the tweet are stored. As for the RT I could probably get away with this by using .replaceFirst() I guess.
Thanks for the explanation. One can always type before "RT ..." right? The issue will not go away with .replaceFirst() then... :)
I have changed the code a bit (also adding link removal functionality and taking care of the RT issue) but even though the code definitely looks better I am not sure if it is quite efficient since I am still using 5 .replaceAll().

Stack Exchange Network

More efficient way to make a Twitter status in a string of just words

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

More efficient way to make a Twitter status in a string of just words

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions