5
\$\begingroup\$

When I run this function for outputting an inverted index to a text file in Debug Configuration, it takes nearly two minutes (96 seconds) with a comparatively tiny dataset, 1252 records with the longest being 76 entries.

std::ostream& operator<<(std::ostream& out_stream, const InvertedIndex& rhs) {
 std::ostringstream output_buffer;
 auto& rhs_index = rhs.GetIndex();
 for(auto map_elem : rhs_index) {
 output_buffer << map_elem.first;
 auto& cur_postingset = map_elem.second;
 for(auto set_elem : cur_postingset) {
 output_buffer << " <" << set_elem.GetDocumentId() << ' ' << set_elem.GetTokenFrequency() << ">";
 }
 output_buffer << '\n';
 }
 output_buffer.flush();
 out_stream.write(output_buffer.str().c_str(), output_buffer.str().size());
 return out_stream;
}

The index is a std::map<std::string, std::vector<std::pair<unsigned long, std::string>>> data type. It makes debugging other portions of the program tedious and time-consuming. Is there any way to speed it up or am I going to have to live with it?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Feb 17, 2015 at 2:16
\$\endgroup\$
2
  • \$\begingroup\$ The only sound advice I can give is to profile the code. That said, GetTokenFrequency is a prime suspect. What does it do in terms of time complexity? Is it just another query, or the actual document scan is involved? \$\endgroup\$ Commented Feb 17, 2015 at 2:42
  • 1
    \$\begingroup\$ Well you don't need to convert it to a string first!! The whole point of the stream is to prevent you doing that in the first place. \$\endgroup\$ Commented Feb 17, 2015 at 3:47

2 Answers 2

2
\$\begingroup\$

In main add:

std::ios_base::sync_with_stdio(false); // do not keep the C and C++ streams synced.
 // No extra works make this quicker.

The quickest way to flush one stream into another is:

stream << otherStream.rdBuf();

So rather than this:

out_stream.write(output_buffer.str().c_str(), output_buffer.str().size());
// prefer
out_stream << output_buffer.rdBuf();

The other things that slows things down is excessive flushing. Don't flush your stream until you really want the output (so its not good practice to flush the stream while you are writing to it). Do it explicitly afterwords.

out_stream << object << std::flush;

But I see no reason for copying all the data into one stream then re-copying the data into another stream.

std::ostream& operator<<(std::ostream& out_stream, const InvertedIndex& rhs)
{
 auto& rhs_index = rhs.GetIndex();
 for(auto map_elem : rhs_index) {
 out_stream << map_elem.first;
 auto& cur_postingset = map_elem.second;
 for(auto set_elem : cur_postingset) {
 out_stream << " <" << set_elem.GetDocumentId() << ' ' << set_elem.GetTokenFrequency() << ">";
 }
 out_stream << '\n';
 }
 return out_stream;
}
answered Feb 17, 2015 at 4:08
\$\endgroup\$
4
  • \$\begingroup\$ No changes to the timing. :( \$\endgroup\$ Commented Feb 17, 2015 at 20:47
  • \$\begingroup\$ That I find suspicious. What is the definition of GetIndex(); Well actually you should probably provide the definition of: InvertedIndex \$\endgroup\$ Commented Feb 17, 2015 at 21:32
  • \$\begingroup\$ Oh. just read your comment. Its only slow in debug. That is not that surprising. Debug has very few optimizations (if any) turned on. So a lot of the copy elitions are not being done (so big structures may be copied where they are built in place after optimization are turned on). But this will all be how you write the InvertedIndex class. So you should really provide more details about that. \$\endgroup\$ Commented Feb 17, 2015 at 21:35
  • \$\begingroup\$ Solved. See new answer. \$\endgroup\$ Commented Feb 18, 2015 at 0:53
1
\$\begingroup\$

The simple creation of an asynchronous task for destroying of a vector of strings of size 500K+ on a separate thread was the culprit (regardless of location in the code). Removed future declarations and calls to get(). Improvement from 40 seconds to less than 2. 95% increase in timing performance.

void OutputIndex(const std::tr2::sys::path& output_path, const InvertedIndex& index) {
 std::ostringstream ss;
 ss << output_path.directory_string() << "/index.out";
 std::cout << "Outputting index results to: " << ss.str() << std::endl;
 std::ofstream out_file;
 out_file.open(ss.str());
 out_file << index; //<-- Thought that THIS was the culprit.
 out_file.close();
}
void DestroyList(std::vector<std::string>* words_list) {
 words_list->clear();
 delete words_list;
}
void Finalization(InvertedIndex& index, std::vector<std::future<InvertedIndex*>>& results, const std::tr2::sys::path& output_path) {
 //Turns out it was these calls...
 //std::future<void> destroy_stop_words_fut = std::async(std::launch::async, DestroyList, words_stop);
 //std::future<void> destroy_proper_words_fut = std::async(std::launch::async, DestroyList, words_proper_nouns);
 //std::future<void> destroy_nonproper_words_fut = std::async(std::launch::async, DestroyList, words_nonproper_nouns);
 std::cout << "\nSorting index...";
 index.Sort();
 std::cout << "Done." << std::endl;
 OutputIndex(output_path, index);
 OutputStatistics(output_path, index);
 //..and these calls...
 //It didn't matter where they were. Even at the end of the program.
 //Well after they should have completed.
 //std::cout << "Cleaning up stop list...";
 //destroy_stop_words_fut.get();
 //std::cout << "Done." << std::endl;
 //std::cout << "Cleaning up nouns list...";
 //destroy_nonproper_words_fut.get();
 //std::cout << "Done." << std::endl;
 //std::cout << "Cleaning up proper nouns list...";
 //destroy_proper_words_fut.get();
 //std::cout << "Done." << std::endl;
}
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
answered Feb 18, 2015 at 0:53
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.