Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Improving swizzles #1086

Open
Open
@Andersama

Description

Here's an example table lookup problem program with assembly from godbolt:

https://godbolt.org/z/9WP19sfq8

Compare the assembly between the "simdless" versions and the result from the template:

template<typename T>
T categorize(T& out, T& lo, T& hi, const T& in) {
 T l = in & 0xf;
 //l = xsimd::swizzle(lo,l);
 T h = in >> 4;
 //h = xsimd::swizzle(hi,h);
 return out = lookup(lo,l) & lookup(hi,h);
}

The "simdless" version not only produces less instructions to process the same amount of data, it really is just faster! Compiling with MSVC makes the difference worse.

However modeling the roughly equivalent simd-swizzle and using it as a callback:

std::array<uint8_t,16> swizzle_like(const std::array<uint8_t,16> &table, const std::array<uint8_t,16>& idxs) {
 std::array<uint8_t,16> out;
 for (size_t i = 0; i < idxs.size(); i++) {
 out[i] = table[idxs[i]&0xf];
 }
 return out;
}

produces a similar effect.

There's likely an optimization in not using simd except where instructions like pshufb are used.

For context here's some benchmarks using MSVC:

| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 325.10 | 3,075,962.91 | 3.0% | 0.01 | `byte lookup (sizzzle_like_but_256_wide)`
| 2,456.13 | 407,144.73 | 13.7% | 0.01 | :wavy_dash: `simd lookup (categorize)` (Unstable with ~288.1 iters. Increase `minEpochIterations` to e.g. b41)

NOTE: The results I'm seeing from godbolt appear to indicate that this is the case when using sse2.
call xsimd::batch<unsigned char, xsimd::sse2> categorize<xsimd::batch<unsigned char, xsimd::sse2>>(xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2>&, xsimd::batch<unsigned char, xsimd::sse2> const&)
When sse4.1 is enabled the performance is better. (I could not find a way to enable ssse3 with MSVC).

| 1,242.30 | 804,958.97 | 0.6% | 0.01 | `simd lookup (categorize)`

I also wanted to benchmark against an emulated or generic batch for comparison but I'm not sure how to express that type.

Edit: using clang-cl I managed to enable what I think is ssse3, for reference clang's performance is

| 38.09 | 26,255,319.15 | 0.3% | 0.01 | `byte lookup (ssse3)`
| 40.52 | 24,681,447.96 | 1.5% | 0.01 | `byte lookup (sse2)`
| 13.57 | 73,690,033.38 | 0.0% | 0.01 | `simd lookup (ssse3)`
| 204.76 | 4,883,829.00 | 7.0% | 0.01 | :wavy_dash: `simd lookup (sse2)` (Unstable with ~5,035.6 iters. Increase `minEpochIterations` to e.g. c4b4)

Take note that there's a massive performance difference between sse2 and ssse3 when using clang, such that it actually starts out performing the byte lookup! (What we'd hope).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /