A project I'm working on requires something akin to run length encoding on a vector in matlab. The data is in the form of a numeric vector, and the output is in the form of two vectors, elems
and lens
that hold the elements and their respective run lengths. For example data = [20 13 12 12 12 18 17 16 16 11]
would give elems = [20 13 12 18 17 16 11]
and lens = [1 1 3 1 1 2 1]
. There is no minimum run length.
My first attempt uses loops:
clear all
data = [2 1 2 3 3 1 2 2 3 3 1 1 1 1 1 2 2 1 1 3 3 3 3 2 1 3 3 2 2 1 1 1 2 2 2 2 3 1 1 1];
n = length(data);
elems = zeros(n, 1);
lens = zeros(n, 1);
len = 1;
elem = data(1);
c = 1;
for k = 2:n
if data(k) == elem
len = len + 1;
else
elems(c) = elem;
lens(c) = len;
c = c + 1;
elem = data(k);
len = 1;
end
end
elems(c) = elem;
lens(c) = len;
lens = lens(lens ~= 0);
elems = elems(lens ~= 0);
It's pretty basic although I couldn't think of a way to efficiently preallocate the elems
and lens
vectors to something smaller than n
. I figured I could always start them small, e.g. n/2
or something, keep track of their size, then reallocate them to something larger, maybe 1.5 times the previous size, if they were running out of room, or something like that.
My second attempt is completely vectorized:
clear all
data = [2 1 2 3 3 1 2 2 3 3 1 1 1 1 1 2 2 1 1 3 3 3 3 2 1 3 3 2 2 1 1 1 2 2 2 2 3 1 1 1];
breaks = [true data(2:end) ~= data(1:end-1)];
run_starts = find([breaks true]);
elems_vec = data(breaks);
lens_vec = diff(run_starts);
Can either of these methods be improved? The algorithms are pretty simple, and Im really just looking for general feedback that can help me learn matlab a bit more.
1 Answer 1
I'm not sure about the performance of my following code however I'm using tabulate()
to do the job. As you can see, the loops for filtering elements not to compute the frequency of elements. In other words, you might want to look at tabulate()
or hist()
to compute the frequent occurrence of elements.
x = [20 13 12 12 12 18 17 16 16 11];
j = 1;
elems = []; % store repeated elements
for i = 1:length(x)
if isempty(find(elems == x(i)))
elems(j) = x(i);
j = j + 1;
end
end
elems
t = tabulate(x); % get frequency table
for i = 1:length(elems)
lens(i) = t(elems(i),2); % get how many each element in elems vector occurs
end
lens
The result is
elems =
20 13 12 18 17 16 11
lens =
1 1 3 1 1 2 1
-
\$\begingroup\$ I'll look into using
hist()
. I don't want to fork over more money to mathworks just for the statistics toolbox (fortabulate()
). \$\endgroup\$Michael A– Michael A2014年10月08日 14:14:05 +00:00Commented Oct 8, 2014 at 14:14
data(2:end) ~= data(1:end-1)
withdiff(data)
. Otherwise looks perfect to me, good job! +1 \$\endgroup\$diff(data)
is definitely not the same asdata(2:end) ~= data(1:end-1)
when using my example data (just run both in matlab to see that). \$\endgroup\$diff
is basically difference between consecutive elements and it would work for your problem too -breaks = [true ; diff(data)];
\$\endgroup\$diff
does, but rundiff(data)
anddata(2:end) ~= data(1:end-1)
on my example data and you can see clearly that you get different results (obviously, because for example, if you have [2 4],diff
will return 2, but mine will return 1). Can you add an example of what you're talking about? If I make your replacement withdiff
in my second code sample, the code will fail because theelems_vec = data(breaks);
will try to indexdata
with a non-logical vector (sincediff
doesn't return 1's and 0's like my code). \$\endgroup\$breaks = [true diff(data)~=0]
and in effect is same asdata(2:end) ~= data(1:end-1)
and thus would result in identical to your results. I am not sure if this is better with performance, which you can find out though. Rest of the code looked perfect to me. \$\endgroup\$