Sparse String Matrix Storage Class C++

Question 1

Introduction

I implemented a class to store a matrix with entries at positions (i,j) that are strings. The matrices I would like to efficiently store in this class have strings as entries and it occurs very often that entries are the same. I am at the moment not interested in reading out the matrix, but to store the overall information as efficiently as possible.

Code

My class looks as follows

#include <vector>
#include <cstdint>
#include <fstream>
#include <iostream>
#include <string>
#include <algorithm>
struct DataClass {
 std::vector<uint64_t> matrix_element_is {};
 std::vector<uint64_t> matrix_element_js {};
 std::vector<uint64_t> matrix_amplitude_ids {};
 std::vector<std::string> amplitudes {};
 void reserve(uint64_t const s_me_ids, uint64_t const s_ma_ids, uint64_t const r_amps) {
 matrix_element_is.reserve(s_me_ids); 
 matrix_element_js.reserve(s_me_ids); 
 matrix_amplitude_ids.reserve(s_ma_ids); 
 amplitudes.reserve(r_amps);
 }
 void add_matrix_element(uint64_t const i, uint64_t const j, std::string const & elm) {
 matrix_element_is.push_back(i);
 matrix_element_js.push_back(j);
 auto const it = std::find(std::begin(amplitudes),std::end(amplitudes),elm);
 matrix_amplitude_ids.push_back(distance(std::begin(amplitudes),it));
 if (it == std::end(amplitudes)) amplitudes.push_back(elm);
 }
 void write_to_file(std::string const & fn) const {
 std::ofstream f(fn);
 for (auto & e: amplitudes) f << e << "\n";
 for (uint64_t i {}; i < matrix_element_is.size(); ++i){
 f << matrix_element_is[i] << "," << matrix_element_js[i] << "," << matrix_amplitude_ids[i];
 if (i==(matrix_element_is.size()-1)) break;
 f << "\n";
 }
 f << std::flush;
 f.close();
 }
};
int main () {
 DataClass B {};
 B.add_matrix_element(0UL,0UL,"h0");
 B.add_matrix_element(1UL,1UL,"h1");
 B.add_matrix_element(1UL,2UL,"h1");
 B.add_matrix_element(0UL,1UL,"h0");
 for (auto & e : B.matrix_element_is) std::cout << e << " ";
 std::cout << std::endl;
 for (auto & e : B.matrix_element_js) std::cout << e << " ";
 std::cout << std::endl;
 for (auto & e : B.matrix_amplitude_ids) std::cout << e << " ";
 std::cout << std::endl;
 for (auto & e : B.amplitudes) std::cout << e << " ";
 std::cout << std::endl;
 B.write_to_file("test.dat");
}

The class has basically three functionalities:

Reserving memory for the vectors that store the data
Storing new entries. A new entry (i,j,s) is stored, by appending i and j to the respective vectors. The string is stored in the amplitudes vector, if the string is not yet in the amplitudes vectors. The position of the amplitude in the amplitudes vector is then stored to associate the correct amplitude with the respective (i,j)
Writing the class to a file

I also included some demo of the class in the main function.

My main question would be:

Do you have ideas how to store the the information (uint64_t,uint64_t,std::string) more compactly and efficiently?

And do have ideas how to store the struct more efficiently in a file?

Otherwise I would also be interested in feedback:

If there are any evident flaws in the layout of my struct?

There are stylistic improvements I can make to my code?

Question 2

We're missing an include of <iterator>. This can be made more obvious by sorting the standard library includes.

You seem to have misspelt std::uint64_t throughout. Are you sure you need exactly 64 bits, or would std::uint_least64_t be a better choice?

Also misspelt: std::distance.

The parallel vectors matrix_element_is, matrix_element_js and matrix_amplitude_ids will be more efficient if those are in a single array containing a suitable aggregate type. In any case, the storage should be class-private to avoid breaking its internal consistency.

Linear search in amplitudes is going to get expensive when this gets large. Consider a pair of std::map instead. Or a single std::map if write_to_file is infrequent.

There's nothing to prevent us adding duplicate entries, and no way to retrieve any element that's stored.

write_to_file() is usually implemented as operator<<. Then it's easier to test (as we can write to a std::ostringstream and examine that), and it has a standard way to report failure (through the fail bit of the stream). This implementation gives us no way of knowing whether write_to_file() successfully wrote the whole contents to file or not.

There's no way to reliably read back from file, as we have no way to tell where each amplitude element ends and the next one begins.

f << std::flush is pointless, as that's implied by the immediately following f.close().

Question 3

I think std::size_t is much better than std::uint64_t here. You could store unique "amplitudes" in a std::set, and store iterators to that set into matrix_amplitude_ids.

Toby Speight Toby Speight 87.9k14 gold badges104 silver badges325 bronze badges · Accepted Answer · 2022-10-26 12:29:03Z

We're missing an include of <iterator>. This can be made more obvious by sorting the standard library includes.

You seem to have misspelt std::uint64_t throughout. Are you sure you need exactly 64 bits, or would std::uint_least64_t be a better choice?

Also misspelt: std::distance.

The parallel vectors matrix_element_is, matrix_element_js and matrix_amplitude_ids will be more efficient if those are in a single array containing a suitable aggregate type. In any case, the storage should be class-private to avoid breaking its internal consistency.

Linear search in amplitudes is going to get expensive when this gets large. Consider a pair of std::map instead. Or a single std::map if write_to_file is infrequent.

There's nothing to prevent us adding duplicate entries, and no way to retrieve any element that's stored.

write_to_file() is usually implemented as operator<<. Then it's easier to test (as we can write to a std::ostringstream and examine that), and it has a standard way to report failure (through the fail bit of the stream). This implementation gives us no way of knowing whether write_to_file() successfully wrote the whole contents to file or not.

There's no way to reliably read back from file, as we have no way to tell where each amplitude element ends and the next one begins.

f << std::flush is pointless, as that's implied by the immediately following f.close().

I think std::size_t is much better than std::uint64_t here. You could store unique "amplitudes" in a std::set, and store iterators to that set into matrix_amplitude_ids.

Stack Exchange Network

Sparse String Matrix Storage Class C++

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Sparse String Matrix Storage Class C++

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions