I am working with a rather large data set (upwards of 1 million entries across 3 variables), and need to isolate a subset of that data with specific values in one of the 3 data fields. The data are strings in hexadecimal. I have working code that performs the necessary steps, but the part of the code that checks whether the required data field has the right value isn't very fast given that it is used once for each of the data entries (1 million+ times), and so filtering the data set down to the desired subset is slow (and this a problem because I have ~140 desired values for the data entry in question and I don't know which one is the data subset I am looking for (basically I am having to find a needle in a haystack). The part that is slow appears to be the string comparison with == operation, is there a faster way to do this many comparisons (or otherwise filter the data into the needed subsets) in Matlab? Alternatively, how easy would it be to write some python code to do this task w/ the original text data?
Below is the portion of code that takes nearly all of the run time:
for n = 17:18
% Column4 Data Concatenation
Col4Cat = {};
tic
for i = 1:length(Data{n,3} (18:length(Data{n,3}),1))
% Workaround for not being able to index into a temporary array.
A = Data{n,3} (18:length(Data{n,3}),1);
tic
if string(A(i)) == string(ID)
Col4Cat = cat(1,Col4Cat,[Data{n,1}(i+17),Data{n,3}(i+17),Data{n,4}(i+17)]);
% Workaround for not being able to index into a temporary array.
B = Data{n,1} (18:length(Data{n,1}),1);
C = Data{n,4} (18:length(Data{n,4}),1);
Column1 = cat(1,Column1,B(i));
Column3 = cat(1,Column3,A(i));
Column4 = cat(1,Column4,C(i));
end
toc
end
toc
end
1 Answer 1
Although the code in the question is not complete, there are sufficient bad practices in the snippet that I wanted to post an answer.
Repeated computation in a loop
When you do a computation or an indexing operation inside a loop that does not change with every loop iteration, you should move that outside the loop. This:
for i = 1:length(Data{n,3} (18:length(Data{n,3}),1))
% Workaround for not being able to index into a temporary array.
A = Data{n,3} (18:length(Data{n,3}),1);
should be
A = Data{n,3} (18:length(Data{n,3}),1);
for i = 1:length(A)
Similarly for B
and C
.
Also, you repeatedly get the length of arrays, you could save that to a variable before the loop starts.
Using length
It is bad practice to use length
, because it is unspecific. For example, say you have an Nx3 array c
, you want to know the N. Doing length(c)
will give you N, as long as N>=3. One day you'll get a shorter array, say N=2, but length(c)
will give you 3. [This is because length(c)
is the same as max(size(c))
.] This is a hard bug to find!!!
Instead, use size
to get the size along a specific dimensions: size(c,1)
will give you N no matter what N is.
You can also use numel
if you know the array is 1D, or if you simply want to iterate over all elements of the array no matter how many dimensions it has. numel
returns the number of elements in the array, and is nice in combinations with linear indexing.
Instead of
for i = 1:length(A)
I would do
for i = 1:numel(A)
length
is safe in this case, but it causes so much grief that I avoid it everywhere, even in places where it cannot go wrong.
Using length
or size
or numel
instead of end
Use end
to indicate the last element of an array. You don't need to explicitly get the array's size to index the last element. Instead of
A = Data{n,3}(18:length(Data{n,3}),1);
do
A = Data{n,3}(18:end,1);
Repeatedly extending an array
Yes, you can extend an array in MATLAB. The more efficient way to do so is to index outside the array, instead of Column1 = cat(1,Column1,B(i))
do Column1(end+1) = B(i)
. See this Q&A for why this is better.
But it is much, much better to allocate an array of sufficient size before the loop. In this case you don't know how many elements you'll need, but you know the maximum number you could possibly need. It is more efficient to allocate an array of that size, and cut off the unused portion after the loop.
Not using vectorized operations
Loops are rather fast nowadays in MATLAB, so much so that vectorizing an operation (removing the explicit loop in favour of vectorized operations) is not guaranteed to improve speed, oftentimes vectorized operations are slower because they use more memory.
But in the cases where vectorization simplifies code, there is no reason not to vectorize things. Simpler code is easier to maintain, and will most often run faster.
Your inner loop is easy to vectorize, will most likely run faster, and will avoid the need to preallocate those arrays of which you don't know what their final length will be.
string(A)
will create an array of strings, which can be compared to a single string ID
, producing a logical array of the same size as A
. This logical array will be true
where the A
matches ID
.
The inner loop is, as far as I can tell, equivalent to:
A = Data{n,3}(18:end,1);
index = string(A) == string(ID);
if any(index) % skip this part if no strings match
B = Data{n,1}(18:end,1);
C = Data{n,4}(18:end,1);
Column1 = B(index);
Column3 = A(index);
Column4 = C(index);
Col4Cat = [Column1, Column3, Column4];
end
Note I moved the Col4Cat
definition to the end to avoid repeating the same indexing operation.
-
\$\begingroup\$ (What became of
if string(A(i)) == string(ID) Col4Cat = ...
?) \$\endgroup\$greybeard– greybeard2023年03月05日 13:01:12 +00:00Commented Mar 5, 2023 at 13:01 -
\$\begingroup\$ @greybeard MATLAB can do comparisons on arrays of strings, no need to compare them one by one. \$\endgroup\$Cris Luengo– Cris Luengo2023年03月05日 14:54:22 +00:00Commented Mar 5, 2023 at 14:54
string
s already or are theychar
s? How have you determined it's the string comparison that is slow, have you profiled it on a small dataset with profile? You seem to be building an array then wiping it, is that what you want? \$\endgroup\$