Questions tagged [sse]

Streaming SIMD Extensions (SSE) is the first generation of SIMD Intel's instruction sets available on modern x86-compatible CPUs. SSE offers single-precision floating point arithmetic and integer arithmetic (excluding division) and logical operations on packed or single operands of sizes from 8 to 64 bits.

29 questions

Newest Active Bountied Unanswered

4 votes

0 answers

77 views

16x16 integer matrix transpose using SSE2 intrinsics in C

I was inspired by this and this to make a C function that would take an array of 16 __m128i, treat it as a matrix of 16x16 ...

Steve Ward's user avatar

Steve Ward

asked Apr 21 at 21:13

7 votes

1 answer

328 views

AVX2 8x8 Float Matrix Multiply in Rust

I'm interested in a fast 8x8 32-bit float matrix multiply in Rust, assuming availability of AVX2. After learning about the AVX2 intrinsics, here is what I came up with: ...

Ana's user avatar

Ana

asked Jan 18, 2024 at 19:44

1 vote

2 answers

347 views

Count the number of mismatches between two arrays

This function may compute the amount of unequal elements of two char-arrays of the length n: ...

HeapUnderStop's user avatar

HeapUnderStop

asked Jun 10, 2023 at 12:45

1 vote

1 answer

241 views

Insert an array[4] to an array[8] (C++, SSE)

I have this code to get audio output levels in dB to an array (peak_dB[8]) to be used in real time peakmeter: ...

Juha P's user avatar

Juha P

asked Mar 17, 2021 at 12:25

5 votes

2 answers

526 views

SSE Assembly vs GCC Compiler - Dot Product

I am currently taking an introductory course in computer architecture. Our goal was to write a dot-product function in x86 Assembly which would use SSE and SIMD (without AVX). I am not to that ...

TVSuchty's user avatar

TVSuchty

asked Jun 3, 2020 at 19:16

6 votes

0 answers

119 views

4×4 cofactor in SSE

The cofactor of a ×ばつ4 matrix can be used to convert a "regular geometry" matrix into the matrix that transforms the normals. It's an alternative to the common inverse-transpose pattern. In this post I ...

user555045's user avatar

user555045

asked Aug 6, 2019 at 14:00

3 votes

1 answer

107 views

Generic pixel class to seamlessly alpha-blend and convert between different pixel structure layouts

Does what it says in the title. I just finished this and wanted to share with someone. Looking for possible optimizations, bugs (most of it is tested to work) or any constructive criticism. ...

user5434231's user avatar

user5434231

asked Sep 5, 2018 at 1:19

5 votes

1 answer

213 views

Fast Hardy-Weinberg equilibrium simulation

I was very bored over one of my breaks this year, so I built a Hardy-Weinberg equilibrium simulator for two unrelated alleles of the same gene. Hardy-Weinberg equilibrium is when there is no evolution,...

computergorl's user avatar

computergorl

asked Aug 24, 2018 at 4:37

7 votes

1 answer

2k views

SIMD memcpy assembler implementation

I am fairly rusty with assembler, let alone the AT&T syntax. I would appreciate it if someone with more experience could please review the following memcpy implementation. Note that this will only ...

Geoffrey's user avatar

Geoffrey

asked May 17, 2018 at 9:39

1 vote

1 answer

612 views

Fast affine transformations of many 3D points by one 3×4 matrix

I wrote a function to batch-transform 3D vectors by a single 3x4 matrix using SSE2: ...

S.V.D.'s user avatar

S.V.D.

asked Nov 26, 2017 at 13:03

7 votes

3 answers

3k views

Converting Array of Floats to UINT8 (`char`) or UINT16 (`unsigned short`) Using SSE4

The problem is given image in 32 Bit Floating Point Format (float) how to convert it to UINT8 (char) or UNIT16 (...

Royi's user avatar

Royi

asked Oct 21, 2017 at 19:36

1 vote

1 answer

924 views

Finding the Minimum and Maximum Value in an Image

Given an image which is padded to support aligning (SSE) I need to find its minimum and maximum value as fast as possible. Mind you the padded values are not defined and can't be assumed to have ...

Royi's user avatar

Royi

asked Oct 11, 2017 at 0:48

10 votes

2 answers

17k views

AVX SIMD in matrix multiplication

I have coded the following C function for multiplying two NxN matrices and using AVX vectors to speed up the calculation. It works but the speedup is not what is to be expected(some scalar code is ...

Henrik Ståhlberg's user avatar

Henrik Ståhlberg

asked Oct 10, 2017 at 12:24

10 votes

2 answers

2k views

Vectorized and Multi Threaded Image Convolution

I created code for Image Convolution. The code is in my Image Convolution GitHub Repository. It includes the case for arbitrary Image Convolution and for Separable Kernel Convolution. The code is a ...

Royi's user avatar

Royi

asked Aug 5, 2017 at 10:31

11 votes

3 answers

614 views

SSE loop to walk likely primes

This is a continuation of a discussion that was started here. While there are some interesting points there about instruction timing and latency, it is not necessary to read that Question to ...

David Wohlferd's user avatar

David Wohlferd

1,518

asked Jul 21, 2017 at 6:46

15 30 50 per page

2 Next

Stack Exchange Network

Questions tagged [sse]

16x16 integer matrix transpose using SSE2 intrinsics in C

AVX2 8x8 Float Matrix Multiply in Rust

Count the number of mismatches between two arrays

Insert an array[4] to an array[8] (C++, SSE)

SSE Assembly vs GCC Compiler - Dot Product

4×4 cofactor in SSE

Generic pixel class to seamlessly alpha-blend and convert between different pixel structure layouts

Fast Hardy-Weinberg equilibrium simulation

SIMD memcpy assembler implementation

Fast affine transformations of many 3D points by one 3×4 matrix

Converting Array of Floats to UINT8 (`char`) or UINT16 (`unsigned short`) Using SSE4

Finding the Minimum and Maximum Value in an Image

AVX SIMD in matrix multiplication

Vectorized and Multi Threaded Image Convolution

SSE loop to walk likely primes

Hot Network Questions