I have implemented a program that produces a permuted index. Actually it is this exercise: https://stackoverflow.com/questions/4015016/what-is-a-permuted-index
Could anyone tell me if I did something wrong, and if it could be done better?
An input is for example:
The quick brown fox
And result:
The quick brown fox
The quick brown fox
The quick brown fox
The quick brown fox
main.cpp
#include "myImplementation.cpp"
using namespace std;
vector<string> split(const string &s) {
vector<string> ret;
string::size_type i = 0;
while (i != s.size()) {
while (i != s.size() && isspace(s[i]))
++i;
string::size_type j = i;
while (j != s.size() && !isspace(s[j]))
j++;
if (i != j) {
ret.push_back(s.substr(i, j - i));
i = j;
}
}
return ret;
}
int main() {
string phrase;
cout << "Please give me some phrase" << endl;
getline(cin, phrase);
vector <string> splitted = split(phrase);
vector<string> permuted = makePermutedIndex(splitted);
for (const auto i : permuted) {
cout << i << endl;
}
return 0;
}
myImplementation.cpp
#include <vector>
#include <sstream>
#include <algorithm>
using namespace std;
vector<string> concatLeftAndRight(const vector<string> &left, const vector<string> &right,
const string::size_type maxLengthOfLeftLine) {
vector<string> ret;
for (vector<string>::size_type i = 0; i != left.size(); ++i) {
std::stringstream ss;
ss << string(maxLengthOfLeftLine - left[i].size(), ' ')
<< left[i] << " " << right[i];
ret.push_back(ss.str());
}
return ret;
}
vector<string> makePermutedIndex(const vector<string> &splitted) {
vector<string> sorted(splitted);
std::sort(sorted.begin(), sorted.end());
string::size_type maxLengthOfLeftLine = 0;
vector<string> left;
vector<string> right;
for (auto index : sorted) {
string rightLine, leftLine;
bool indexEncountered = false;
for (auto word : splitted) {
if ((index != word) && !indexEncountered) {
leftLine += word;
leftLine += " ";
}
else {
indexEncountered = true;
rightLine += word;
rightLine += " ";
}
}
maxLengthOfLeftLine = max(maxLengthOfLeftLine, leftLine.size());
left.push_back(leftLine);
right.push_back(rightLine);
}
return concatLeftAndRight(left, right, maxLengthOfLeftLine);
}
-
\$\begingroup\$ Follow-up question \$\endgroup\$200_success– 200_success2016年03月10日 00:26:09 +00:00Commented Mar 10, 2016 at 0:26
2 Answers 2
I would do this somewhat differently. First of all, I'd use a class to represent an index, and another to represent an index entry. From there, the details would depend (at least somewhat) on the degree to which efficiency mattered to me. My immediate instinct would probably be to not care much--chances are pretty good that it ends up (at least largely) I/O bound for most inputs. That being the case, it's not usually worth worrying a lot about efficiency of building the index.
Assuming a situation where you decided to go for simplicity over efficiency, I'd probably start by creating one class for an index, and another for an index entry. An index_entry would hold two strings: one of the left and one for the right part of a line. It would have a constructor to create an index entry from a vector of strings representing words, and a location at which to split that vector. It would probably define an operator<
to compare two entries (based primarily the right column, and secondarily on the left column). It should probably do case-insensitive comparisons.
An index would hold a vector of index entries. It would have an add
to add a line to the index, a sort
to sort the entries.
Each of those would also have a stream insertion operator.
I'd probably write split
to use a stringstream
instead of doing the splitting on my own:
std::vector<std::string> split(std::string const &s) {
std::istringstream in(s);
std::vector<std::string> words { std::istream_iterator<std::string>(in),
std::istream_iterator<std::string>() };
return words;
}
Reasonably complete code could look something like this:
#include <vector>
#include <string>
#include <iterator>
#include <sstream>
#include <iostream>
#include <iomanip>
#include <algorithm>
#include <fstream>
std::vector<std::string> split(std::string const &s) {
std::istringstream in(s);
std::vector<std::string> words { std::istream_iterator<std::string>(in),
std::istream_iterator<std::string>() };
return words;
}
struct index_entry {
std::string left;
std::string right;
index_entry(std::vector < std::string> const &in, size_t split) {
for (int i = 0; i < split; i++)
left += in[i] + " ";
for (int i = split; i < in.size(); i++)
right += in[i] + " ";
}
friend std::ostream &operator<<(std::ostream &os, index_entry const &e) {
return os << std::setw(40) << e.left << "\t" << e.right;
}
bool operator<(index_entry const &other) {
if (right < other.right)
return true;
if (other.right < right)
return false;
return left < other.left;
}
};
struct index {
std::vector<index_entry> entries;
bool sorted;
public:
index() : sorted(false) { }
void add(std::string const &in) {
std::vector<std::string> words = split(in);
for (size_t i = 0; i < words.size(); i++)
entries.emplace_back(words, i);
sorted = false;
}
void sort() {
std::sort(entries.begin(), entries.end());
}
friend std::ostream &operator<<(std::ostream &os, index &i) {
for (auto const &e : i.entries)
os << e << "\n";
return os;
}
};
int main() {
std::stringstream input { "the quick brown fox\nthe lazy yellow dog\nthe insufferable white cat" };
std::string line;
index i;
while (std::getline(input, line))
i.add(line);
i.sort();
std::cout << i;
}
efficiency
After reading vnp's answer, I feel obliged to comment for a moment on efficiency. My immediate reaction is more or less the opposite of his: in real use, I'd expect:
- this to be heavily I/O bound under most circumstances
- the memory usage to be inconsequential
It does produce substantially more output than input, and it (currently) stores all that output. In a quick test I gave it about 80 kilobytes (about 33 pages) of input, from which it produced about 1.4 megabytes of output (and that rate of expansion is probably fairly representative, at least for English text). For a reasonably large book, we might multiply that by a factor of, say, 10. Since it's storing the entire output somewhat inefficiently maximum memory usage might get as high as, say, 30 megabytes or so. Even if I decided to build such an index on my phone, that probably wouldn't be a major concern.
At least in my opinion, long before that was a concern, I'd add code to produce a better index, such as not indexing words that are so common they're unlikely to be of any use (e.g., "a", "an", "the", "it"). Each of these currently creates an entirely line of output, and they're probably more common than words that might be meaningful, so adding that would probably reduce memory usage (and index size) by a factor of at least 2 (and quite possibly more).
When/if it really was necessary to address such concerns, I'd probably think in terms of leaving most of that existing code almost as it is, except that it would deal with some sort of string_view
instead of a string
. The idea would be that a string_view
stores (for example) only a pointer to char and a length, and depends on you to manage the fact that the original string it points to needs to stay valid at least as long as the string_view
.
-
\$\begingroup\$ Thanks for reply! I have changed it to use classes, should I replace my code or create new thread? \$\endgroup\$Mateusz– Mateusz2016年03月09日 14:51:27 +00:00Commented Mar 9, 2016 at 14:51
-
\$\begingroup\$ @Mateusz: A new thread (probably with a link to this one). \$\endgroup\$Jerry Coffin– Jerry Coffin2016年03月09日 16:08:57 +00:00Commented Mar 9, 2016 at 16:08
-
\$\begingroup\$ ok, this is new thread: codereview.stackexchange.com/questions/122382/… \$\endgroup\$Mateusz– Mateusz2016年03月09日 16:32:26 +00:00Commented Mar 9, 2016 at 16:32
The code does not implement the permuted index. Instead of
The quick brown fox The quick brown fox The quick brown fox The quick brown fox
it produces
The quick brown fox The quick brown fox The quick brown fox The quick brown fox
Notice that the order is wrong. The permuted index shall be sorted by the second column.
using namespace std;
is considered bad practice.Never
#include
a.cpp
file. See for example this discussion for details.Try not use so many strings. It hurts both memory usage and performance. In fact there is no need to break the initial string into words and reconstruct left and right strings. A vector of whitespace locations in the original string suffices.
-
\$\begingroup\$ oh yeah I know but this is sort function problem with capital letters \$\endgroup\$Mateusz– Mateusz2016年03月08日 23:37:44 +00:00Commented Mar 8, 2016 at 23:37
-
\$\begingroup\$ thanks for hints. I didnt make a header file because there was no structs, variables etc and just was easier to me to keep it in one file \$\endgroup\$Mateusz– Mateusz2016年03月08日 23:39:02 +00:00Commented Mar 8, 2016 at 23:39