Performance Optimization in Matlab

Question 1

I am working with a rather large data set (upwards of 1 million entries across 3 variables), and need to isolate a subset of that data with specific values in one of the 3 data fields. The data are strings in hexadecimal. I have working code that performs the necessary steps, but the part of the code that checks whether the required data field has the right value isn't very fast given that it is used once for each of the data entries (1 million+ times), and so filtering the data set down to the desired subset is slow (and this a problem because I have ~140 desired values for the data entry in question and I don't know which one is the data subset I am looking for (basically I am having to find a needle in a haystack). The part that is slow appears to be the string comparison with == operation, is there a faster way to do this many comparisons (or otherwise filter the data into the needed subsets) in Matlab? Alternatively, how easy would it be to write some python code to do this task w/ the original text data?

Below is the portion of code that takes nearly all of the run time:

for n = 17:18
% Column4 Data Concatenation
Col4Cat = {};
tic
for i = 1:length(Data{n,3} (18:length(Data{n,3}),1))
 % Workaround for not being able to index into a temporary array.
 A = Data{n,3} (18:length(Data{n,3}),1);
 tic
 if string(A(i)) == string(ID)
 Col4Cat = cat(1,Col4Cat,[Data{n,1}(i+17),Data{n,3}(i+17),Data{n,4}(i+17)]);
 % Workaround for not being able to index into a temporary array.
 B = Data{n,1} (18:length(Data{n,1}),1);
 C = Data{n,4} (18:length(Data{n,4}),1);
 
 Column1 = cat(1,Column1,B(i));
 Column3 = cat(1,Column3,A(i));
 Column4 = cat(1,Column4,C(i));
 end
 toc
end
toc

end

Question 2

Welcome to Code Review! The current question title, which states your concerns about the code, is too general to be useful here. Please edit to the site standard, which is for the title to simply state the task accomplished by the code. Please see How to get the best value out of Code Review: Asking Questions for guidance on writing good question titles.

Question 3

Without knowing the structure of the data it's also impossible to say. Is it a ragged list, is that why it's a cell array? Are the elements of the list strings already or are they chars? How have you determined it's the string comparison that is slow, have you profiled it on a small dataset with profile? You seem to be building an array then wiping it, is that what you want?

Question 4

"Alternatively, how easy would it be to write some python code to do this task w/ the original text data?" it would be easy, I guess, but it wouldn’t be faster. The Python interpreter is significantly slower than MATLAB.

Question 5

Someone has voted to close this question because it lacks clarity or details, and that is definitely true. If you told us the problem you are trying to solve we might be able to help you optimize it some other way.

Question 6

Although the code in the question is not complete, there are sufficient bad practices in the snippet that I wanted to post an answer.

Repeated computation in a loop

When you do a computation or an indexing operation inside a loop that does not change with every loop iteration, you should move that outside the loop. This:

for i = 1:length(Data{n,3} (18:length(Data{n,3}),1))
 % Workaround for not being able to index into a temporary array.
 A = Data{n,3} (18:length(Data{n,3}),1);

should be

A = Data{n,3} (18:length(Data{n,3}),1);
for i = 1:length(A)

Similarly for B and C.

Also, you repeatedly get the length of arrays, you could save that to a variable before the loop starts.

Using `length`

It is bad practice to use length, because it is unspecific. For example, say you have an Nx3 array c, you want to know the N. Doing length(c) will give you N, as long as N>=3. One day you'll get a shorter array, say N=2, but length(c) will give you 3. [This is because length(c) is the same as max(size(c)).] This is a hard bug to find!!!

Instead, use size to get the size along a specific dimensions: size(c,1) will give you N no matter what N is.

You can also use numel if you know the array is 1D, or if you simply want to iterate over all elements of the array no matter how many dimensions it has. numel returns the number of elements in the array, and is nice in combinations with linear indexing.

Instead of

for i = 1:length(A)

I would do

for i = 1:numel(A)

length is safe in this case, but it causes so much grief that I avoid it everywhere, even in places where it cannot go wrong.

Using `length` or `size` or `numel` instead of `end`

Use end to indicate the last element of an array. You don't need to explicitly get the array's size to index the last element. Instead of

A = Data{n,3}(18:length(Data{n,3}),1);

do

A = Data{n,3}(18:end,1);

Repeatedly extending an array

Yes, you can extend an array in MATLAB. The more efficient way to do so is to index outside the array, instead of Column1 = cat(1,Column1,B(i)) do Column1(end+1) = B(i). See this Q&A for why this is better.

But it is much, much better to allocate an array of sufficient size before the loop. In this case you don't know how many elements you'll need, but you know the maximum number you could possibly need. It is more efficient to allocate an array of that size, and cut off the unused portion after the loop.

Not using vectorized operations

Loops are rather fast nowadays in MATLAB, so much so that vectorizing an operation (removing the explicit loop in favour of vectorized operations) is not guaranteed to improve speed, oftentimes vectorized operations are slower because they use more memory.

But in the cases where vectorization simplifies code, there is no reason not to vectorize things. Simpler code is easier to maintain, and will most often run faster.

Your inner loop is easy to vectorize, will most likely run faster, and will avoid the need to preallocate those arrays of which you don't know what their final length will be.

string(A) will create an array of strings, which can be compared to a single string ID, producing a logical array of the same size as A. This logical array will be true where the A matches ID.

The inner loop is, as far as I can tell, equivalent to:

A = Data{n,3}(18:end,1);
index = string(A) == string(ID);
if any(index) % skip this part if no strings match
 B = Data{n,1}(18:end,1);
 C = Data{n,4}(18:end,1);
 Column1 = B(index);
 Column3 = A(index);
 Column4 = C(index);
 Col4Cat = [Column1, Column3, Column4];
end

Note I moved the Col4Cat definition to the end to avoid repeating the same indexing operation.

Question 7

(What became of if string(A(i)) == string(ID) Col4Cat = ...?)

Question 8

@greybeard MATLAB can do comparisons on arrays of strings, no need to compare them one by one.

Cris Luengo Cris Luengo 7,0011 gold badge14 silver badges37 bronze badges · Answer 1 · 2023-03-01 20:25:22Z

Although the code in the question is not complete, there are sufficient bad practices in the snippet that I wanted to post an answer.

Repeated computation in a loop

When you do a computation or an indexing operation inside a loop that does not change with every loop iteration, you should move that outside the loop. This:

for i = 1:length(Data{n,3} (18:length(Data{n,3}),1))
 % Workaround for not being able to index into a temporary array.
 A = Data{n,3} (18:length(Data{n,3}),1);

should be

A = Data{n,3} (18:length(Data{n,3}),1);
for i = 1:length(A)

Similarly for B and C.

Also, you repeatedly get the length of arrays, you could save that to a variable before the loop starts.

Using `length`

It is bad practice to use length, because it is unspecific. For example, say you have an Nx3 array c, you want to know the N. Doing length(c) will give you N, as long as N>=3. One day you'll get a shorter array, say N=2, but length(c) will give you 3. [This is because length(c) is the same as max(size(c)).] This is a hard bug to find!!!

Instead, use size to get the size along a specific dimensions: size(c,1) will give you N no matter what N is.

You can also use numel if you know the array is 1D, or if you simply want to iterate over all elements of the array no matter how many dimensions it has. numel returns the number of elements in the array, and is nice in combinations with linear indexing.

Instead of

for i = 1:length(A)

I would do

for i = 1:numel(A)

length is safe in this case, but it causes so much grief that I avoid it everywhere, even in places where it cannot go wrong.

Using `length` or `size` or `numel` instead of `end`

Use end to indicate the last element of an array. You don't need to explicitly get the array's size to index the last element. Instead of

A = Data{n,3}(18:length(Data{n,3}),1);

do

A = Data{n,3}(18:end,1);

Repeatedly extending an array

Yes, you can extend an array in MATLAB. The more efficient way to do so is to index outside the array, instead of Column1 = cat(1,Column1,B(i)) do Column1(end+1) = B(i). See this Q&A for why this is better.

But it is much, much better to allocate an array of sufficient size before the loop. In this case you don't know how many elements you'll need, but you know the maximum number you could possibly need. It is more efficient to allocate an array of that size, and cut off the unused portion after the loop.

Not using vectorized operations

Loops are rather fast nowadays in MATLAB, so much so that vectorizing an operation (removing the explicit loop in favour of vectorized operations) is not guaranteed to improve speed, oftentimes vectorized operations are slower because they use more memory.

But in the cases where vectorization simplifies code, there is no reason not to vectorize things. Simpler code is easier to maintain, and will most often run faster.

Your inner loop is easy to vectorize, will most likely run faster, and will avoid the need to preallocate those arrays of which you don't know what their final length will be.

string(A) will create an array of strings, which can be compared to a single string ID, producing a logical array of the same size as A. This logical array will be true where the A matches ID.

The inner loop is, as far as I can tell, equivalent to:

A = Data{n,3}(18:end,1);
index = string(A) == string(ID);
if any(index) % skip this part if no strings match
 B = Data{n,1}(18:end,1);
 C = Data{n,4}(18:end,1);
 Column1 = B(index);
 Column3 = A(index);
 Column4 = C(index);
 Col4Cat = [Column1, Column3, Column4];
end

Note I moved the Col4Cat definition to the end to avoid repeating the same indexing operation.

(What became of if string(A(i)) == string(ID) Col4Cat = ...?)
@greybeard MATLAB can do comparisons on arrays of strings, no need to compare them one by one.

Stack Exchange Network

Performance Optimization in Matlab

1 Answer 1

Repeated computation in a loop

Using `length`

Using `length` or `size` or `numel` instead of `end`

Repeatedly extending an array

Not using vectorized operations

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Performance Optimization in Matlab

1 Answer 1

Repeated computation in a loop

Using length

Using length or size or numel instead of end

Repeatedly extending an array

Not using vectorized operations

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Using `length`

Using `length` or `size` or `numel` instead of `end`