Questions tagged [simd]
Single Instruction, Multiple Data describes CPU instructions that process many operands in parallel.
49 questions
- Bountied 0
- Unanswered
- Frequent
- Score
- Trending
- Week
- Month
- Unanswered (my tags)
4
votes
0
answers
92
views
16x16 integer matrix transpose using SSE2 intrinsics in C
I was inspired by this and this to make a C function that would take an array of 16 __m128i, treat it as a matrix of 16x16 ...
8
votes
1
answer
399
views
SIMD Softmax implementation
I am learning SIMD and looking for feedback. This is column-wise Softmax for matrices stored in row-major format.
Note that matrices come from outside so padding or dimensions being power of 2 can't ...
4
votes
2
answers
165
views
C - SIMD Code to invert a transformation matrix
I am writing a maths library for a raytracer project, and so I'm trying to make my heavy operations (like matrix inverse) more optimised. After doing some research, I discovered this trick to invert a ...
2
votes
3
answers
237
views
Optimizing a for loop for changing pixels values using lookup table
I tried to parallelize the loop, and I got a good result but still not enough. This post is a follow up to a recent one where I optimized other parts of the code using a lookup table and spacial and ...
5
votes
1
answer
500
views
High Performance Matrix Multiplication is not very high speed, why?
I would appreciate a review of the following Rust implementation of high performance matrix multiplication.
After reviewing available literature, including Anatomy of High Performance Matrix ...
7
votes
1
answer
334
views
AVX2 8x8 Float Matrix Multiply in Rust
I'm interested in a fast 8x8 32-bit float matrix multiply in Rust, assuming availability of AVX2. After learning about the AVX2 intrinsics, here is what I came up with:
...
2
votes
1
answer
96
views
Finding the kth smallest number where all (hexadecimal) digits are different
I'm mostly trying to understand why the simpler char array mask below (to track which digits have been already used) is much ...
1
vote
2
answers
433
views
Count the number of mismatches between two arrays
This function may compute the amount of unequal elements of two char-arrays of the length n:
...
5
votes
1
answer
950
views
Speed up strlen using SWAR in x86-64 assembly
The asm function strlen receives the link to a string as a char - Array. To do so, the function may use SWAR on general purpose register, but without using ...
2
votes
1
answer
150
views
SIMD Vectorizing C Function Generating Floating-point Range
I have a C function that generates a range from the given start, step_size and end values. I ...
4
votes
2
answers
506
views
Search function using SIMD
I wrote a search function, similar to std::find, that uses SIMD instructions. Since I am new to SIMD, I would appreciate comments on other SIMD instructions I have ...
1
vote
1
answer
356
views
Implementing a 1D Convolution SIMD Friendly in Julia
I want to implement a 1D convolution in Julia using the direct calculation since the conv() function in DSP.jl uses DFT (fft) ...
1
vote
2
answers
538
views
Bilinear interpolation optimized using intrinsics
I have found that a bottleneck of the OpenCV application I use is the bilinear interpolation, so I have tried to optimize it. The bilinear interpolation is in 8D space, so each "color" is an ...
4
votes
1
answer
1k
views
C++ Binary search using SIMD
Recently I found that the binary search (std::ranges::lower_bound and std::ranges::upper_bound) is the main bottleneck in my ...
3
votes
1
answer
1k
views
Sum two vectors in x86 assembly
I recently made a program with C++ and ASM. Can anyone help me make this code a more efficient one, in the ASM part or both.
I would really appreciate it because I don't know every ASM instruction and ...