How can I optimize this code?

Question 1

My current project has us using TreeSet and TreeMap in Java, with an input array of 10514 Song elements read in from a text file. Each Song contains a Artist, Title and Lyric fields. The aim of this project is to conduct fast searches on the lyrics using sets and maps.

First, I iterate over the input Song array, accessing the lyrics field and creating a Scanner object to iterate over the lyric words using this code: commonWords is a TreeSet of words that should not be keys, and lyricWords is the overall map of words to Songs.

public void buildSongMap() {
 for (Song song:songs) {
 //method variables
 String currentLyrics= song.getLyrics().toLowerCase(); 
 TreeSet<Song> addToSet=null;
 Scanner readIn= new Scanner(currentLyrics);
 String word= readIn.next();
 while (readIn.hasNext()) {
 if (!commonWords.contains(word) && !word.equals("") && word.length()>1) {
 if (lyricWords.containsKey(word)) {
 addToSet= lyricWords.get(word);
 addToSet.add(song);
 word=readIn.next();
 } else 
 buildSongSet(word);
 } else 
 word= readIn.next();
 }
 }

In order to build the songSet, I use this code:

public void buildSongSet(String word) { 
 TreeSet<Song> songSet= new TreeSet<Song>();
 for (Song song:songs) {
 //adds song to set 
 if (song.getLyrics().contains(word)) {
 songSet.add(song);
 }
 }
 lyricWords.put(word, songSet);
 System.out.println("Word added "+word);
}

Now, since buildSongSet is called from inside a loop, creating the map executes in N^2 time. When the input array is 4 songs, searches run very fast, but when using the full array of 10514 elements, it can take over 15+ min to build the map on a 2.4GHz machine with 6 GiB RAM. What can I do to make this code more efficient? Unfortunately, reducing the input data is not an option.

Question 2

Could you include the definitions of commonWords and lyricWords

Question 3

added definitions. commonWords is a set of words that should not be keys and lyricWords is the overall structure that searches are run on.

Question 4

It looks like your buildSongSet is doing redundant work. Your block:

if (lyricWords.containsKey(word)) {
 addToSet= lyricWords.get(word);
 addToSet.add(song);
 word=readIn.next();
}

adds a song to an existing set. So, when you find a word you don't know about, just add one song to it. Change buildSongSet to:

public void buildSongSet(String word, Song firstSongWithWord) { 
 TreeSet<Song> songSet= new TreeSet<Song>();
 songSet.add(firstSongWithWord);
 lyricWords.put(word, songSet);
 System.out.println("Word added "+word);
}

the remaining songs left to be iterated will then be added to that songset from the first block of code if they contain that word. I think that should work.

EDIT just saw this was homework... so removed the HashSet recommendations..

Ok.. so suppose you have these Songs in order with lyrics:

Song 1 - foo
Song 2 - foo bar
Song 3 - foo bar baz

Song 1 will see that foo does not contain lyricWords, so it will call buildSongSet and create a set for foo. It will add itself into the set containing foo.

Song 2 will see that foo is in lyricWords, and add itself to the set. It will see bar is not in the set, and create a set and add itself. It doesn't need to traverse previous songs since the first time the word was seen was in Song 2.

Song 3 follows the same logic.

Another thing you can try doing to optimize your code is to figure out a way to not process duplicate words in the lyrics. if your lyrics are foo foo foo foo bar bar bar foo bar then you're going to be doing a lot of unnecessary checks.

EDIT also see rsp's answer - additional speedups there, but the big speedup is getting rid of the inner loop - glad it's down to 15 secs now.

Question 5

My thinking was when a new word is encountered, iterate over the song list and add the matches to the set. If I follow your code, only the songs after that point will be added to the set, but there is the possiblity that the previous songs will also be matches, thus the nested loop.

Question 6

If the previous songs contained that word, wouldn't it have triggered buildSongSet anyway? If you follow the logic, previous songs won't contain any words that you haven't already called buildSongSet for. I'll try to explain in the answer..

Question 7

Alright, makes sense. I altered the code, it executes in ~15 seconds now. Eliminating that second loop did the trick. Unfortunately, using HashSet is not an option, project requires using TreeSet

Question 8

The whole buildSongSet() method is not needed imho, as your main loop already adds songs to the collection by word. The only thing you are missing is the addition of a set for a new word, something like:

if (lyricWords.containsKey(word)) {
 addToSet= lyricWords.get(word);
} else {
 addToSet = new TreeSet();
 lyricWords.put(word, addToSet);
}
addToSet.add(song);

One issue that you did not tackle is that songs end up being added to the set multiple times, for every occurence of the word in the song.

Another issue is that in the case that a song contains just 1 word, you do not add it at all! It is always better to check the condition first:

String word = null;
while (readIn.hasNext()) {
 word = readIn.next();

Your condition is doing one check too many (the empty string has length < 1), and swapping the checks can speed up things too:

if (word.length() > 1 && !commonWords.contains(word)) {

Question 9

"One issue that you did not tackle is that songs end up being added to the set multiple times, for every occurence of the word in the song." This is a Set, not a List. If you try to add a duplicate element, it just returns false.

Question 10

thats why I didn't use a conditional to check. both contains() and add() are O(logN). If the add were O(N), then I would use the check.

Question 11

I marked the other answer the accepted answer because it helped me the most, but if possible I would have accepted yours as well. Thanks for showing me the reordering of the if statement

Question 12

Please, try change TreeSet to HashSet. I can't see where you obtain the benefits of TreeSet.

Question 13

except this is a requirement of the project. From the objectives: Using Java TreeSet and TreeMap and set operations

Question 14

if you want a very extensible, easy way of solving this with performance in the order of a few millisecons. Consider lucene http://lucene.apache.org/

refer to my answer here for example of how to index and search How do I index and search text files in Lucene 3.0.2?

Question 15

For a real-world situation, I might consider your solution, but this is a data structures class project designed to give us hands-on experience with all aspects of structures. Using Lucene would short-circuit this purpose.

NG. NG. 23k6 gold badges57 silver badges62 bronze badges · Accepted Answer · 2010-11-03 16:27:57Z

It looks like your buildSongSet is doing redundant work. Your block:

if (lyricWords.containsKey(word)) {
 addToSet= lyricWords.get(word);
 addToSet.add(song);
 word=readIn.next();
}

adds a song to an existing set. So, when you find a word you don't know about, just add one song to it. Change buildSongSet to:

public void buildSongSet(String word, Song firstSongWithWord) { 
 TreeSet<Song> songSet= new TreeSet<Song>();
 songSet.add(firstSongWithWord);
 lyricWords.put(word, songSet);
 System.out.println("Word added "+word);
}

the remaining songs left to be iterated will then be added to that songset from the first block of code if they contain that word. I think that should work.

EDIT just saw this was homework... so removed the HashSet recommendations..

Ok.. so suppose you have these Songs in order with lyrics:

Song 1 - foo
Song 2 - foo bar
Song 3 - foo bar baz

Song 1 will see that foo does not contain lyricWords, so it will call buildSongSet and create a set for foo. It will add itself into the set containing foo.

Song 2 will see that foo is in lyricWords, and add itself to the set. It will see bar is not in the set, and create a set and add itself. It doesn't need to traverse previous songs since the first time the word was seen was in Song 2.

Song 3 follows the same logic.

Another thing you can try doing to optimize your code is to figure out a way to not process duplicate words in the lyrics. if your lyrics are foo foo foo foo bar bar bar foo bar then you're going to be doing a lot of unnecessary checks.

EDIT also see rsp's answer - additional speedups there, but the big speedup is getting rid of the inner loop - glad it's down to 15 secs now.

My thinking was when a new word is encountered, iterate over the song list and add the matches to the set. If I follow your code, only the songs after that point will be added to the set, but there is the possiblity that the previous songs will also be matches, thus the nested loop.
If the previous songs contained that word, wouldn't it have triggered buildSongSet anyway? If you follow the logic, previous songs won't contain any words that you haven't already called buildSongSet for. I'll try to explain in the answer..
Alright, makes sense. I altered the code, it executes in ~15 seconds now. Eliminating that second loop did the trick. Unfortunately, using HashSet is not an option, project requires using TreeSet

CollectivesTM on Stack Overflow

How can I optimize this code?

4 Answers 4

3 Comments

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

3 Comments

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related