Skip to main content
Code Review

Questions tagged [sse]

Streaming SIMD Extensions (SSE) is the first generation of SIMD Intel's instruction sets available on modern x86-compatible CPUs. SSE offers single-precision floating point arithmetic and integer arithmetic (excluding division) and logical operations on packed or single operands of sizes from 8 to 64 bits.

Filter by
Sorted by
Tagged with
4 votes
0 answers
77 views

16x16 integer matrix transpose using SSE2 intrinsics in C

I was inspired by this and this to make a C function that would take an array of 16 __m128i, treat it as a matrix of 16x16 ...
7 votes
1 answer
328 views

AVX2 8x8 Float Matrix Multiply in Rust

I'm interested in a fast 8x8 32-bit float matrix multiply in Rust, assuming availability of AVX2. After learning about the AVX2 intrinsics, here is what I came up with: ...
Ana's user avatar
Ana
  • 129
1 vote
2 answers
347 views

Count the number of mismatches between two arrays

This function may compute the amount of unequal elements of two char-arrays of the length n: ...
1 vote
1 answer
241 views

Insert an array[4] to an array[8] (C++, SSE)

I have this code to get audio output levels in dB to an array (peak_dB[8]) to be used in real time peakmeter: ...
Juha P's user avatar
  • 11
5 votes
2 answers
526 views

SSE Assembly vs GCC Compiler - Dot Product

I am currently taking an introductory course in computer architecture. Our goal was to write a dot-product function in x86 Assembly which would use SSE and SIMD (without AVX). I am not to that ...
TVSuchty's user avatar
  • 605
6 votes
0 answers
119 views

4×4 cofactor in SSE

The cofactor of a ×ばつ4 matrix can be used to convert a "regular geometry" matrix into the matrix that transforms the normals. It's an alternative to the common inverse-transpose pattern. In this post I ...
3 votes
1 answer
107 views

Generic pixel class to seamlessly alpha-blend and convert between different pixel structure layouts

Does what it says in the title. I just finished this and wanted to share with someone. Looking for possible optimizations, bugs (most of it is tested to work) or any constructive criticism. ...
5 votes
1 answer
213 views

Fast Hardy-Weinberg equilibrium simulation

I was very bored over one of my breaks this year, so I built a Hardy-Weinberg equilibrium simulator for two unrelated alleles of the same gene. Hardy-Weinberg equilibrium is when there is no evolution,...
7 votes
1 answer
2k views

SIMD memcpy assembler implementation

I am fairly rusty with assembler, let alone the AT&T syntax. I would appreciate it if someone with more experience could please review the following memcpy implementation. Note that this will only ...
1 vote
1 answer
612 views

Fast affine transformations of many 3D points by one 3×4 matrix

I wrote a function to batch-transform 3D vectors by a single 3x4 matrix using SSE2: ...
S.V.D.'s user avatar
  • 21
7 votes
3 answers
3k views

Converting Array of Floats to UINT8 (`char`) or UINT16 (`unsigned short`) Using SSE4

The problem is given image in 32 Bit Floating Point Format (float) how to convert it to UINT8 (char) or UNIT16 (...
1 vote
1 answer
924 views

Finding the Minimum and Maximum Value in an Image

Given an image which is padded to support aligning (SSE) I need to find its minimum and maximum value as fast as possible. Mind you the padded values are not defined and can't be assumed to have ...
Royi's user avatar
  • 582
10 votes
2 answers
17k views

AVX SIMD in matrix multiplication

I have coded the following C function for multiplying two NxN matrices and using AVX vectors to speed up the calculation. It works but the speedup is not what is to be expected(some scalar code is ...
10 votes
2 answers
2k views

Vectorized and Multi Threaded Image Convolution

I created code for Image Convolution. The code is in my Image Convolution GitHub Repository. It includes the case for arbitrary Image Convolution and for Separable Kernel Convolution. The code is a ...
11 votes
3 answers
614 views

SSE loop to walk likely primes

This is a continuation of a discussion that was started here. While there are some interesting points there about instruction timing and latency, it is not necessary to read that Question to ...

15 30 50 per page
1
2

AltStyle によって変換されたページ (->オリジナル) /