I am making an application where I will be fetching tweets and storing them in a database. I will have a column for the complete text of the tweet and another where only the words of the tweet will remain (I need the words to calculate which words were most used later).
How I currently do it is by using 6 different .replaceAll()
functions which some of them might be triggered twice. For example I will have a for loop to remove every "hashtag" using replaceAll()
.
The problem is that I will be editing as many as thousands of tweets that I fetch every few minutes and I think that the way I am doing it will not be too efficient.
What my requirements are in this order (also written in comments down below):
- Delete all usernames mentioned
- Delete all RT (retweets flags)
- Delete all hashtags mentioned
- Replace all break lines with spaces
- Replace all double spaces with single spaces
- Delete all special characters except spaces
Here is a short and compilable example:
public class StringTest {
public static void main(String args[]) {
String text = "RT @AshStewart09: Vote for Lady Gaga for \"Best Fans\""
+ " at iHeart Awards\n"
+ "\n"
+ "RT!!\n"
+ "\n"
+ "My vote for #FanArmy goes to #LittleMonsters #iHeartAwards"
+ " htt...";
String[] hashtags = {"#FanArmy", "#LittleMonsters", "#iHeartAwards"};
System.out.println("Before: " + text + "\n");
// Delete all usernames mentioned (may run multiple times)
text = text.replaceAll("@AshStewart09", "");
System.out.println("First Phase: " + text + "\n");
// Delete all RT (retweets flags)
text = text.replaceAll("RT", "");
System.out.println("Second Phase: " + text + "\n");
// Delete all hashtags mentioned
for (String hashtag : hashtags) {
text = text.replaceAll(hashtag, "");
}
System.out.println("Third Phase: " + text + "\n");
// Replace all break lines with spaces
text = text.replaceAll("\n", " ");
System.out.println("Fourth Phase: " + text + "\n");
// Replace all double spaces with single spaces
text = text.replaceAll(" +", " ");
System.out.println("Fifth Phase: " + text + "\n");
// Delete all special characters except spaces
text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();
System.out.println("Finaly: " + text);
}
}
1 Answer 1
(subject to further changes)
In your simple example, how are the hashtags and usernames actually derived from the tweet?
My suggestion is to tokenize the tweet by whitespaces first, then look at the individual words to determine if they must be stored ("Vote") or discarded ("#LittleMonsters").
// Delete all RT (retweets flags)
text = text.replaceAll("RT", "");
You do realize that this will turn text like "ART!" into just "A!" right? Tokenizing first should remedy this issue.
On a related note, Apache Incubator Storm's tutorials usually use tweets as an example to demonstrate the Big Data approach. I'm not suggesting that you need such a set-up in your context, but perhaps you can give those a quick read-through to pick up some tips.
-
\$\begingroup\$ I am using the
twitter4j
library to grab the tweets. Every tweet has thetext
field and it also has ausernameMentioned
array (where all the mentioned usernames are kept) and ahashtags
where all the hashtags in the tweet are stored. As for theRT
I could probably get away with this by using.replaceFirst()
I guess. \$\endgroup\$Aki K– Aki K2014年04月17日 09:18:20 +00:00Commented Apr 17, 2014 at 9:18 -
\$\begingroup\$ Thanks for the explanation. One can always type before "RT ..." right? The issue will not go away with
.replaceFirst()
then... :) \$\endgroup\$h.j.k.– h.j.k.2014年04月17日 10:05:05 +00:00Commented Apr 17, 2014 at 10:05 -
\$\begingroup\$ I have changed the code a bit (also adding link removal functionality and taking care of the RT issue) but even though the code definitely looks better I am not sure if it is quite efficient since I am still using 5
.replaceAll()
. \$\endgroup\$Aki K– Aki K2014年04月17日 10:29:20 +00:00Commented Apr 17, 2014 at 10:29
Explore related questions
See similar questions with these tags.