Implementing permuted index in C++

Question 1

I have implemented a program that produces a permuted index. Actually it is this exercise: https://stackoverflow.com/questions/4015016/what-is-a-permuted-index

Could anyone tell me if I did something wrong, and if it could be done better?

An input is for example:

The quick brown fox

And result:

 The quick brown fox 
The quick brown fox 
 The quick brown fox 
 The quick brown fox

main.cpp

#include "myImplementation.cpp"
using namespace std;
vector<string> split(const string &s) {
 vector<string> ret;
 string::size_type i = 0;
 while (i != s.size()) {
 while (i != s.size() && isspace(s[i]))
 ++i;
 string::size_type j = i;
 while (j != s.size() && !isspace(s[j]))
 j++;
 if (i != j) {
 ret.push_back(s.substr(i, j - i));
 i = j;
 }
 }
 return ret;
}
int main() {
 string phrase;
 cout << "Please give me some phrase" << endl;
 getline(cin, phrase);
 vector <string> splitted = split(phrase);
 vector<string> permuted = makePermutedIndex(splitted);
 for (const auto i : permuted) {
 cout << i << endl;
 }
 return 0;
}

myImplementation.cpp

#include <vector>
#include <sstream>
#include <algorithm>
using namespace std;
vector<string> concatLeftAndRight(const vector<string> &left, const vector<string> &right,
 const string::size_type maxLengthOfLeftLine) {
 vector<string> ret;
 for (vector<string>::size_type i = 0; i != left.size(); ++i) {
 std::stringstream ss;
 ss << string(maxLengthOfLeftLine - left[i].size(), ' ')
 << left[i] << " " << right[i];
 ret.push_back(ss.str());
 }
 return ret;
}
vector<string> makePermutedIndex(const vector<string> &splitted) {
 vector<string> sorted(splitted);
 std::sort(sorted.begin(), sorted.end());
 string::size_type maxLengthOfLeftLine = 0;
 vector<string> left;
 vector<string> right;
 for (auto index : sorted) {
 string rightLine, leftLine;
 bool indexEncountered = false;
 for (auto word : splitted) {
 if ((index != word) && !indexEncountered) {
 leftLine += word;
 leftLine += " ";
 }
 else {
 indexEncountered = true;
 rightLine += word;
 rightLine += " ";
 }
 }
 maxLengthOfLeftLine = max(maxLengthOfLeftLine, leftLine.size());
 left.push_back(leftLine);
 right.push_back(rightLine);
 }
 return concatLeftAndRight(left, right, maxLengthOfLeftLine);
}

Question 2

Follow-up question

Question 3

I would do this somewhat differently. First of all, I'd use a class to represent an index, and another to represent an index entry. From there, the details would depend (at least somewhat) on the degree to which efficiency mattered to me. My immediate instinct would probably be to not care much--chances are pretty good that it ends up (at least largely) I/O bound for most inputs. That being the case, it's not usually worth worrying a lot about efficiency of building the index.

Assuming a situation where you decided to go for simplicity over efficiency, I'd probably start by creating one class for an index, and another for an index entry. An index_entry would hold two strings: one of the left and one for the right part of a line. It would have a constructor to create an index entry from a vector of strings representing words, and a location at which to split that vector. It would probably define an operator< to compare two entries (based primarily the right column, and secondarily on the left column). It should probably do case-insensitive comparisons.

An index would hold a vector of index entries. It would have an add to add a line to the index, a sort to sort the entries.

Each of those would also have a stream insertion operator.

I'd probably write split to use a stringstream instead of doing the splitting on my own:

std::vector<std::string> split(std::string const &s) {
 std::istringstream in(s);
 std::vector<std::string> words { std::istream_iterator<std::string>(in),
 std::istream_iterator<std::string>() };
 return words;
}

Reasonably complete code could look something like this:

#include <vector>
#include <string>
#include <iterator>
#include <sstream>
#include <iostream>
#include <iomanip>
#include <algorithm>
#include <fstream>
std::vector<std::string> split(std::string const &s) {
 std::istringstream in(s);
 std::vector<std::string> words { std::istream_iterator<std::string>(in),
 std::istream_iterator<std::string>() };
 return words;
}
struct index_entry {
 std::string left;
 std::string right;
 index_entry(std::vector < std::string> const &in, size_t split) {
 for (int i = 0; i < split; i++)
 left += in[i] + " ";
 for (int i = split; i < in.size(); i++)
 right += in[i] + " ";
 }
 friend std::ostream &operator<<(std::ostream &os, index_entry const &e) { 
 return os << std::setw(40) << e.left << "\t" << e.right;
 }
 bool operator<(index_entry const &other) { 
 if (right < other.right)
 return true;
 if (other.right < right)
 return false;
 return left < other.left;
 }
};
struct index {
 std::vector<index_entry> entries;
 bool sorted;
public:
 index() : sorted(false) { }
 void add(std::string const &in) {
 std::vector<std::string> words = split(in);
 for (size_t i = 0; i < words.size(); i++)
 entries.emplace_back(words, i);
 sorted = false;
 }
 void sort() {
 std::sort(entries.begin(), entries.end());
 }
 friend std::ostream &operator<<(std::ostream &os, index &i) {
 for (auto const &e : i.entries)
 os << e << "\n";
 return os;
 }
};
int main() {
 std::stringstream input { "the quick brown fox\nthe lazy yellow dog\nthe insufferable white cat" };
 std::string line;
 index i;
 while (std::getline(input, line))
 i.add(line);
 i.sort();
 std::cout << i;
}

efficiency

After reading vnp's answer, I feel obliged to comment for a moment on efficiency. My immediate reaction is more or less the opposite of his: in real use, I'd expect:

this to be heavily I/O bound under most circumstances
the memory usage to be inconsequential

It does produce substantially more output than input, and it (currently) stores all that output. In a quick test I gave it about 80 kilobytes (about 33 pages) of input, from which it produced about 1.4 megabytes of output (and that rate of expansion is probably fairly representative, at least for English text). For a reasonably large book, we might multiply that by a factor of, say, 10. Since it's storing the entire output somewhat inefficiently maximum memory usage might get as high as, say, 30 megabytes or so. Even if I decided to build such an index on my phone, that probably wouldn't be a major concern.

At least in my opinion, long before that was a concern, I'd add code to produce a better index, such as not indexing words that are so common they're unlikely to be of any use (e.g., "a", "an", "the", "it"). Each of these currently creates an entirely line of output, and they're probably more common than words that might be meaningful, so adding that would probably reduce memory usage (and index size) by a factor of at least 2 (and quite possibly more).

When/if it really was necessary to address such concerns, I'd probably think in terms of leaving most of that existing code almost as it is, except that it would deal with some sort of string_view instead of a string. The idea would be that a string_view stores (for example) only a pointer to char and a length, and depends on you to manage the fact that the original string it points to needs to stay valid at least as long as the string_view.

Question 4

Thanks for reply! I have changed it to use classes, should I replace my code or create new thread?

Question 5

@Mateusz: A new thread (probably with a link to this one).

Question 6

ok, this is new thread: codereview.stackexchange.com/questions/122382/…

Question 7

The code does not implement the permuted index. Instead of

 The quick brown fox 
The quick brown fox 
 The quick brown fox 
 The quick brown fox

it produces

 The quick brown fox 
 The quick brown fox 
The quick brown fox 
 The quick brown fox

Notice that the order is wrong. The permuted index shall be sorted by the second column.

using namespace std; is considered bad practice.
Never #include a .cpp file. See for example this discussion for details.
Try not use so many strings. It hurts both memory usage and performance. In fact there is no need to break the initial string into words and reconstruct left and right strings. A vector of whitespace locations in the original string suffices.

Question 8

oh yeah I know but this is sort function problem with capital letters

Question 9

thanks for hints. I didnt make a header file because there was no structs, variables etc and just was easier to me to keep it in one file

Jerry Coffin Jerry Coffin 34.1k4 gold badges77 silver badges144 bronze badges · Accepted Answer · 2016-03-08 23:33:50Z

I would do this somewhat differently. First of all, I'd use a class to represent an index, and another to represent an index entry. From there, the details would depend (at least somewhat) on the degree to which efficiency mattered to me. My immediate instinct would probably be to not care much--chances are pretty good that it ends up (at least largely) I/O bound for most inputs. That being the case, it's not usually worth worrying a lot about efficiency of building the index.

Assuming a situation where you decided to go for simplicity over efficiency, I'd probably start by creating one class for an index, and another for an index entry. An index_entry would hold two strings: one of the left and one for the right part of a line. It would have a constructor to create an index entry from a vector of strings representing words, and a location at which to split that vector. It would probably define an operator< to compare two entries (based primarily the right column, and secondarily on the left column). It should probably do case-insensitive comparisons.

An index would hold a vector of index entries. It would have an add to add a line to the index, a sort to sort the entries.

Each of those would also have a stream insertion operator.

I'd probably write split to use a stringstream instead of doing the splitting on my own:

std::vector<std::string> split(std::string const &s) {
 std::istringstream in(s);
 std::vector<std::string> words { std::istream_iterator<std::string>(in),
 std::istream_iterator<std::string>() };
 return words;
}

Reasonably complete code could look something like this:

#include <vector>
#include <string>
#include <iterator>
#include <sstream>
#include <iostream>
#include <iomanip>
#include <algorithm>
#include <fstream>
std::vector<std::string> split(std::string const &s) {
 std::istringstream in(s);
 std::vector<std::string> words { std::istream_iterator<std::string>(in),
 std::istream_iterator<std::string>() };
 return words;
}
struct index_entry {
 std::string left;
 std::string right;
 index_entry(std::vector < std::string> const &in, size_t split) {
 for (int i = 0; i < split; i++)
 left += in[i] + " ";
 for (int i = split; i < in.size(); i++)
 right += in[i] + " ";
 }
 friend std::ostream &operator<<(std::ostream &os, index_entry const &e) { 
 return os << std::setw(40) << e.left << "\t" << e.right;
 }
 bool operator<(index_entry const &other) { 
 if (right < other.right)
 return true;
 if (other.right < right)
 return false;
 return left < other.left;
 }
};
struct index {
 std::vector<index_entry> entries;
 bool sorted;
public:
 index() : sorted(false) { }
 void add(std::string const &in) {
 std::vector<std::string> words = split(in);
 for (size_t i = 0; i < words.size(); i++)
 entries.emplace_back(words, i);
 sorted = false;
 }
 void sort() {
 std::sort(entries.begin(), entries.end());
 }
 friend std::ostream &operator<<(std::ostream &os, index &i) {
 for (auto const &e : i.entries)
 os << e << "\n";
 return os;
 }
};
int main() {
 std::stringstream input { "the quick brown fox\nthe lazy yellow dog\nthe insufferable white cat" };
 std::string line;
 index i;
 while (std::getline(input, line))
 i.add(line);
 i.sort();
 std::cout << i;
}

efficiency

After reading vnp's answer, I feel obliged to comment for a moment on efficiency. My immediate reaction is more or less the opposite of his: in real use, I'd expect:

this to be heavily I/O bound under most circumstances
the memory usage to be inconsequential

It does produce substantially more output than input, and it (currently) stores all that output. In a quick test I gave it about 80 kilobytes (about 33 pages) of input, from which it produced about 1.4 megabytes of output (and that rate of expansion is probably fairly representative, at least for English text). For a reasonably large book, we might multiply that by a factor of, say, 10. Since it's storing the entire output somewhat inefficiently maximum memory usage might get as high as, say, 30 megabytes or so. Even if I decided to build such an index on my phone, that probably wouldn't be a major concern.

At least in my opinion, long before that was a concern, I'd add code to produce a better index, such as not indexing words that are so common they're unlikely to be of any use (e.g., "a", "an", "the", "it"). Each of these currently creates an entirely line of output, and they're probably more common than words that might be meaningful, so adding that would probably reduce memory usage (and index size) by a factor of at least 2 (and quite possibly more).

When/if it really was necessary to address such concerns, I'd probably think in terms of leaving most of that existing code almost as it is, except that it would deal with some sort of string_view instead of a string. The idea would be that a string_view stores (for example) only a pointer to char and a length, and depends on you to manage the fact that the original string it points to needs to stay valid at least as long as the string_view.

Thanks for reply! I have changed it to use classes, should I replace my code or create new thread?
ok, this is new thread: codereview.stackexchange.com/questions/122382/…

Stack Exchange Network

Implementing permuted index in C++

2 Answers 2

efficiency

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Implementing permuted index in C++

2 Answers 2

efficiency

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions