Introduction
I implemented a class to store a matrix with entries at positions (i,j) that are strings. The matrices I would like to efficiently store in this class have strings as entries and it occurs very often that entries are the same. I am at the moment not interested in reading out the matrix, but to store the overall information as efficiently as possible.
Code
My class looks as follows
#include <vector>
#include <cstdint>
#include <fstream>
#include <iostream>
#include <string>
#include <algorithm>
struct DataClass {
std::vector<uint64_t> matrix_element_is {};
std::vector<uint64_t> matrix_element_js {};
std::vector<uint64_t> matrix_amplitude_ids {};
std::vector<std::string> amplitudes {};
void reserve(uint64_t const s_me_ids, uint64_t const s_ma_ids, uint64_t const r_amps) {
matrix_element_is.reserve(s_me_ids);
matrix_element_js.reserve(s_me_ids);
matrix_amplitude_ids.reserve(s_ma_ids);
amplitudes.reserve(r_amps);
}
void add_matrix_element(uint64_t const i, uint64_t const j, std::string const & elm) {
matrix_element_is.push_back(i);
matrix_element_js.push_back(j);
auto const it = std::find(std::begin(amplitudes),std::end(amplitudes),elm);
matrix_amplitude_ids.push_back(distance(std::begin(amplitudes),it));
if (it == std::end(amplitudes)) amplitudes.push_back(elm);
}
void write_to_file(std::string const & fn) const {
std::ofstream f(fn);
for (auto & e: amplitudes) f << e << "\n";
for (uint64_t i {}; i < matrix_element_is.size(); ++i){
f << matrix_element_is[i] << "," << matrix_element_js[i] << "," << matrix_amplitude_ids[i];
if (i==(matrix_element_is.size()-1)) break;
f << "\n";
}
f << std::flush;
f.close();
}
};
int main () {
DataClass B {};
B.add_matrix_element(0UL,0UL,"h0");
B.add_matrix_element(1UL,1UL,"h1");
B.add_matrix_element(1UL,2UL,"h1");
B.add_matrix_element(0UL,1UL,"h0");
for (auto & e : B.matrix_element_is) std::cout << e << " ";
std::cout << std::endl;
for (auto & e : B.matrix_element_js) std::cout << e << " ";
std::cout << std::endl;
for (auto & e : B.matrix_amplitude_ids) std::cout << e << " ";
std::cout << std::endl;
for (auto & e : B.amplitudes) std::cout << e << " ";
std::cout << std::endl;
B.write_to_file("test.dat");
}
The class has basically three functionalities:
- Reserving memory for the vectors that store the data
- Storing new entries. A new entry (i,j,s) is stored, by appending i and j to the respective vectors. The string is stored in the amplitudes vector, if the string is not yet in the amplitudes vectors. The position of the amplitude in the amplitudes vector is then stored to associate the correct amplitude with the respective (i,j)
- Writing the class to a file
I also included some demo of the class in the main function.
My main question would be:
Do you have ideas how to store the the information (uint64_t,uint64_t,std::string)
more compactly and efficiently?
And do have ideas how to store the struct more efficiently in a file?
Otherwise I would also be interested in feedback:
If there are any evident flaws in the layout of my struct?
There are stylistic improvements I can make to my code?
1 Answer 1
We're missing an include of <iterator>
. This can be made more obvious by sorting the standard library includes.
You seem to have misspelt std::uint64_t
throughout. Are you sure you need exactly 64 bits, or would std::uint_least64_t
be a better choice?
Also misspelt: std::distance
.
The parallel vectors matrix_element_is
, matrix_element_js
and matrix_amplitude_ids
will be more efficient if those are in a single array containing a suitable aggregate type. In any case, the storage should be class-private to avoid breaking its internal consistency.
Linear search in amplitudes
is going to get expensive when this gets large. Consider a pair of std::map
instead. Or a single std::map
if write_to_file
is infrequent.
There's nothing to prevent us adding duplicate entries, and no way to retrieve any element that's stored.
write_to_file()
is usually implemented as operator<<
. Then it's easier to test (as we can write to a std::ostringstream
and examine that), and it has a standard way to report failure (through the fail bit of the stream). This implementation gives us no way of knowing whether write_to_file()
successfully wrote the whole contents to file or not.
There's no way to reliably read back from file, as we have no way to tell where each amplitude element ends and the next one begins.
f << std::flush
is pointless, as that's implied by the immediately following f.close()
.
-
1\$\begingroup\$ I think
std::size_t
is much better thanstd::uint64_t
here. You could store unique "amplitudes" in astd::set
, and store iterators to that set intomatrix_amplitude_ids
. \$\endgroup\$G. Sliepen– G. Sliepen2022年10月26日 20:14:38 +00:00Commented Oct 26, 2022 at 20:14