I implemented the following function using the matrix object from breeze library. Basically it is just a glorified while
loop.
The convolution there is extremely slow so I rolled out my own, optimized on a kernel which is just a vector. This is already much faster but I figured there must be something else I can to to make this go faster.
The most expensive operation according to the profiler is the instantiation of the ArrayDeque
(pointed by the arrow) which I don’t really need because all I wanted was a circular buffer, but could not find much in the library.
The second thing is calling until
on the Int
. I don't think that can be avoided.
Boxing the doubles is also taking quite some time, maybe there is a way to specialize?
Finally the majority of the time is taken by the function call itself (circled in red) which I don’t know what it means. Any insight?
def conv(m: DenseMatrix[Int], k: DenseVector[Double]): DenseMatrix[Double] = {
val kData = k.data
val dataIter = m.data.iterator
val height = m.rows
val convoluted = Array.newBuilder[Double]
val prev = mutable.ArrayDeque.empty[Double]
for (_ <- 1 until k.length) {
prev.addOne(dataIter.next())
}
var count = k.length - 1
while (dataIter.hasNext) {
val cur = dataIter.next()
val slice = prev.append(cur)
if (count % height >= k.length - 1) {
var r = 0D
for (i <- 0 until k.length) {
r += kData(i) * slice(i)
}
convoluted.addOne(r)
}
prev.removeHead()
count += 1
}
DenseMatrix.create(m.rows - (k.length - 1), m.cols, convoluted.result())
}
Following is the annotated flamechart, please note that fut
is the above function conv
. Everything else is unchanged.
1 Answer 1
all I wanted was a circular buffer
Consider using an array accessed with mod
.
That is, assign size s = k.length - 1
,
allocate a "previous" array p
,
and then access p[i % s]
.
If you're lucky, the compiler may notice that
a portion of the code is just copying,
and it will issue memcpy
instructions.