I have binary files that need to be efficiently processed. The first 8 bytes correspond to metadata, and all the rest is data. From the first 8 bytes I need the last 4 bytes to determine how to structure the rest of the data.
Since I'm new to rust, this seemed like a good exercise. The following code complies and produces results that seeem reasonable.
use std::convert::TryInto;
use ndarray::Array2;
use chrono::prelude::*;
fn four_bytes_to_array(barry: &[u8]) -> &[u8; 4] {
barry.try_into().expect("slice with incorrect length")
}
fn eight_bytes_to_array(barry: &[u8]) -> &[u8; 8] {
barry.try_into().expect("slice with incorrect length")
}
fn bin_file_to_matrix(file_name: &str) -> ndarray::Array2<f64> {
// Read in file_content
let mut file_content = std::fs::read(file_name).expect("Could not read file!");
// The first 4 bytes are some random information, the second 4 bytes are the
// number of data-points per spectrum
let nr_dp_per_spectrum = four_bytes_to_array(&file_content[4..8]);
// We combine the 4 bytes into an unsigned integer
let nr_dp_per_spectrum = u32::from_be_bytes(*nr_dp_per_spectrum);
// Calculate how many spectra there are in this file
let how_many_spectra = file_content.len() as u32/8/(nr_dp_per_spectrum + 1u32);
// Create a buffer to write the data to
let dim = ndarray::Dim([how_many_spectra as usize, nr_dp_per_spectrum as usize]);
let mut data = Array2::<f64>::zeros(dim);
// Remove the first 8 bytes that contain metadata we have already processed
file_content.drain(0..8);
for i in 0..how_many_spectra {
for j in 0..nr_dp_per_spectrum {
let idx = ( (nr_dp_per_spectrum+1) * i + j * 8 ) as usize;
let tmp = eight_bytes_to_array( &file_content[idx..idx+8] );
let val = f64::from_be_bytes( *tmp );
data[ndarray::Ix2(i as usize, j as usize)] = val;
}
}
data
}
fn main() {
let start = Utc::now();
let res = bin_file_to_matrix("./data/example.bin");
let difference = Utc::now() - start;
println!("Time:\t {:?}", difference);
}
Is there a way to speed up the code?
1 Answer 1
Is there a way to speed up the code?
The speed is determined only be the nested for loop. Nothing glaring stands out to me. Are you compiling in debug mode (without most optimizations) or in release mode (with optimizations on)?
I cleaned up your code somewhat, though I don't expect it to have any performance impact, mostly reducing necessity of casts and adding a generic wrapper around from_be_bytes
which automatically advances the reference to the bytes to be read by the correct amount. This also obviates the need for calculating the index in the loop, which may have a slight performance impact, though similar work is instead done by read_be
, so the impact is probably small...
use chrono::prelude::*;
use ndarray::Array2;
use std::convert::TryInto;
trait EndianRead {
fn read_be(input: &mut &[u8]) -> Self;
}
macro_rules! impl_EndianRead_for_nums (( $($num:ident),* ) => {
$(
impl EndianRead for $num {
fn read_be(input: &mut &[u8]) -> Self {
let (bytes, rest) = input.split_at(std::mem::size_of::<Self>());
*input = rest;
Self::from_be_bytes(bytes.try_into().unwrap())
}
}
)*
});
impl_EndianRead_for_nums!(u32, f64);
fn bin_file_to_matrix(file_name: &str) -> ndarray::Array2<f64> {
// Read in file_content
let file_content = std::fs::read(file_name).expect("Could not read file!");
// The first 4 bytes are some random information
let mut byte_content = &file_content[4..];
// the second 4 bytes are the number of data-points per spectrum
// We combine the 4 bytes into an unsigned integer
let nr_dp_per_spectrum = <u32 as EndianRead>::read_be(&mut byte_content) as usize;
// Calculate how many spectra there are in this file
let spectrum_size = std::mem::size_of::<f64>() * nr_dp_per_spectrum;
let how_many_spectra = byte_content.len() / spectrum_size;
// Create a buffer to write the data to
let dim = ndarray::Dim([how_many_spectra, nr_dp_per_spectrum]);
let mut data = Array2::<f64>::zeros(dim);
for i in 0..how_many_spectra {
for j in 0..nr_dp_per_spectrum {
let val = <f64 as EndianRead>::read_be(&mut byte_content);
data[ndarray::Ix2(i as usize, j as usize)] = val;
}
}
data
}
fn main() {
let start = Utc::now();
let _res = bin_file_to_matrix("./data/example.bin");
let difference = Utc::now() - start;
println!("Time:\t {:?}", difference);
}
-
2\$\begingroup\$
I cleaned up your code ...
I suspect there are more observations you could make about the code. Could you answer in the question what about the code made you clean it up. \$\endgroup\$2021年12月26日 14:28:48 +00:00Commented Dec 26, 2021 at 14:28 -
1\$\begingroup\$ @pacmaninbw, thanks for your suggestion. It was mostly my impression of too much casting going on, so I'll add that in my answer. \$\endgroup\$hkBst– hkBst2021年12月26日 15:35:28 +00:00Commented Dec 26, 2021 at 15:35