×ばつ"> ×ばつ · scala/scala3 · Discussion #23649" />×ばつ, and foreach by up to ×ばつ..." /> ×ばつ, and foreach by up to ×ばつ — b..." />×ばつ · scala/scala3 · Discussion #23649" />×ばつ, and foreach by up to ×ばつ — b..." />
Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Simplified implementations of Array.foreach and Array.map improve performance by ×ばつ #23649

2Pit started this conversation in Feature Requests
Discussion options

Hi!

I've been benchmarking performance of Array.map and Array.foreach in Scala and found that their implementations can be made significantly faster — map by up to ×ばつ, and foreach by up to ×ばつ — by simplifying the loop structure in cases where the function operates on statically known primitive types.

The current implementation relies on a match over array types, introducing overhead and making performance less predictable in cases chained transformations. In contrast, a plain loop yields consistent scaling and significantly lower latency per element.

Here’s how the current map method can be simplified:

// current version
def oldMap[B](f: A => B)(implicit ct: ClassTag[B]): Array[B] = {
 val len = xs.length
 val ys = new Array[B](len)
 if (len > 0) {
 var i = 0
 (xs: Any @unchecked) match {
 case xs: Array[Int] => while (i < len) { ys(i) = f(xs(i).asInstanceOf[A]); i += 1 }
 case xs: Array[Long] => while (i < len) { ys(i) = f(xs(i).asInstanceOf[A]); i += 1 }
 // other primitive types...
 }
 }
 ys
}
// proposed version
def newMap[B](f: A => B)(implicit ct: ClassTag[B]): Array[B] = {
 val len = xs.length
 val ys = new Array[B](len)
 var i = 0
 while (i < len) {
 ys(i) = f(xs(i))
 i += 1
 }
 ys
}

You can find the full analysis with plots and measurements below.
Would this kind of improvement be something you'd like to see in the standard library? I'd be happy to contribute a PR.

📊 https://github.com/2Pit/scala-benchmarks/blob/main/review/array_map_foreach/README.md

map_1
You must be logged in to vote

Replies: 5 comments 1 reply

Comment options

That's certainly interesting. The source code of the standard library (also the one used in Scala 3, for now) is over at https://github.com/scala/scala (Scala 2), there's also some existing infra for writing benchmarks (https://github.com/scala/scala/blob/2.13.x/test/benchmarks/README.md).

It seems to me the old code is trying to prevent primitive boxing / unboxing, I'd be interested to check in detail (in bytecode) if that wasn't actually helpful. It's possible that the situation is different on Scala 2 because Scala 3 doesn't have function specialization (right?).

You must be logged in to vote
1 reply
Comment options

sjrd Aug 9, 2025
Maintainer

It seems to me the old code is trying to prevent primitive boxing / unboxing, I'd be interested to check in detail (in bytecode) if that wasn't actually helpful. It's possible that the situation is different on Scala 2 because Scala 3 doesn't have function specialization (right?).

More importantly, it avoids performing the dispatch from
https://github.com/scala/scala/blob/177b7453e142d3e0b9e5c4a2dc32736e1272c548/src/library/scala/runtime/ScalaRunTime.scala#L56-L69
in every iteration of the loop.

Comment options

@lrytz I didn’t notice any significant differences in the bytecode between Scala 2 and Scala 3 for the same code. I think the main difference lies in the implementation of map: the old version produce fairly large bytecode with type-based branching, and according to the JIT logs, the compiler occasionally fails to inline this method (callee is too large / inlining prohibited by policy). This might prevent boxing elimination and other optimizations inside the loop.

Thank you for the benchmark links. It was interesting to go through them.

It seems to me that the current stdlib benchmarks, for example

@Benchmark def mapIntInt(bh: Blackhole): Unit =
bh.consume(integersA.map(x => 0))
@Benchmark def mapIntString(bh: Blackhole): Unit =
bh.consume(integersA.map(x => ""))

use rather simple lambdas that the JIT will most likely optimize aggressively. In real-world scenarios, where the functions inside map are more complex, the compiler’s behavior could be different.

You must be logged in to vote
0 replies
Comment options

Here’s a more detailed analysis with proof:

Bytecode in benchmarks.
I analyzed the bytecode for Scala 2 and Scala 3 and found no significant differences at the call site of map.
When passing a method to map, the compiler generates a specialized lambda apply$mcII$sp (int -> int), which avoids boxing at the call site.
When passing a regular lambda, it compiles to Function1<Object, Object>, and boxing occurs during the invocation.

// method -> specialized lambda (identical for oldMap/newMap)
bh.consume(ArrayOpsCompat.ArrayOpsExt$.MODULE$.newMap$extension(
 ArrayOpsCompat$.MODULE$.ArrayOpsExt(this.array()),
 (JFunction1.mcII.sp)(x) -> this.heavy(x),
 scala.reflect.ClassTag$.MODULE$.Int()
));
// regular lambda -> (Object -> Object)
bh.consume(ArrayOpsCompat.ArrayOpsExt$.MODULE$.newMap$extension(
 ArrayOpsCompat$.MODULE$.ArrayOpsExt(this.array()),
 (Object x) -> this.genericHeavy(x),
 scala.reflect.ClassTag$.MODULE$.Object()
));

Bytecode inside method implementations.
The differences appear inside the implementations of oldMap$extension and newMap$extension.
The old version oldMap contains a type switch for all primitive and reference array types.
In the int[] branch, primitives are explicitly boxed via BoxesRunTime.boxToInteger before calling f.apply(...). This large branching structure makes the method big and harder for the JIT to optimize.

// oldMap
if ($this instanceof int[]) {
 for (int[] arr = (int[]) $this; i < len; ++i) {
 ys[i] = f.apply(BoxesRunTime.boxToInteger(arr[i]));
 }
} else if ($this instanceof double[]) {
 // branch for double[]
}
// ... other primitive array branches ...

Here in runtime JVM knows the array's type, but doesn't know f type.

The new version newMap is linear: a single loop without type matching, minimal extra code.

// newMap
for (int i = 0; i < len; i++) {
 ys[i] = f.apply(xs[i]);
}

JIT log summary.
DISCLAIMER: I'm inspecting the log manually, so it is possible that can miss something.

Analysis of the JIT logs shows that specialized lambda calls (apply$mcII$sp) were inlined in all cases.
The key difference is that oldMap sometimes fails to be fully inlined due to JIT policy/size limits (see below). This likely prevents the JIT from performing further optimizations, such as eliminating boxing inside the loop.
In contrast, newMap is smaller, allowing the JIT to inline it completely and apply all possible optimizations.

<!-- context (C2): call to benchmarks.ArrayOpsCompat$ArrayOpsExt$.oldMap$extension -->
<method id='1455' holder='1411' name='oldMap$extension' return='1222' arguments='1222 1453 1454' flags='17' bytes='599' compile_id='781' compiler='c2' level='4' iicount='696'/>
<call method='1455' instr='invokevirtual'/>
<inline_fail reason='inlining prohibited by policy'/>
<!-- context (C1): call to benchmarks.ArrayOpsCompat$ArrayOpsExt$.oldMap$extension -->
<method id='1444' holder='1398' name='oldMap$extension' return='1222' arguments='1222 1442 1443' flags='17' bytes='599' compile_id='767' compiler='c1' level='3' iicount='406'/>
<call method='1444' instr='invokevirtual'/>
<inline_fail reason='callee is too large'/>
<!-- context: call to benchmarks.ArrayOpsCompat$ArrayOpsExt$.newMap$extension -->
<method id='1421' holder='1409' name='newMap$extension' return='1222' arguments='1222 1419 1420' flags='17' bytes='63' compile_id='775' compiler='c2' level='4' iicount='636'/>
<call method='1421' count='113050' prof_factor='1.000000' inline='1'/>
<inline_success reason='inline (hot)'/>
<!-- context: specialized lambda inlining -->
<method id='1507' holder='1424' name='apply$mcII$sp' return='1215' arguments='1215' flags='1' bytes='9' compile_id='753' compiler='c1' level='3' iicount='28090'/>
<call method='1507' count='27579' prof_factor='1.000000' inline='1'/>
<inline_success reason='inline (hot)'/>
You must be logged in to vote
0 replies
Comment options

@2Pit see here for the history: scala/scala#7241

You must be logged in to vote
0 replies
Comment options

@lrytz Sorry for the late reply — I got a bit off track, and the results of benchmarks turned out quite surprising.

First of all, I realized that the original method names were confusing. In Szeiger’s PR they were labeled the other way around, so let’s stick to this convention:
long map — version with pattern-matching on array type
short map — version with a simple loop inside


Results from Szeiger (2018)

I focused on this particular benchmark in his PR because it showed the largest difference in performance:

[info] ArrayOpsBenchmark.mapIntIntNew (long) 1000 avgt 405.494 ± 8.233 ns/op
[info] ArrayOpsBenchmark.mapIntIntOld (short) 1000 avgt 1137.030 ± 32.654 ns/op

My results (Scala 2, MacBook Air M3)

When running the benchmarks locally in scala2 project (with default optimization flags "-opt:l:inline", "-opt-inline-from:scala/**"), I didn’t observe any difference between different implementations:

[info] ArrayOpsBenchmark.long_mapIntInt 10000 avgt 1241.526 ± 42.451 ns/op
[info] ArrayOpsBenchmark.short_mapIntInt 10000 avgt 1241.768 ± 26.661 ns/op

If I disable the inline options, the difference shows up: the short version gets inlined automatically, the long one does not.

[info] ArrayOpsBenchmark.long_mapIntInt 10000 avgt 17571.405 ± 467.381 ns/op
[info] ArrayOpsBenchmark.short_mapIntInt 10000 avgt 1240.606 ± 26.588 ns/op

Results in my separate project (Scala 3)

I observed similar results in my own "speedup-array" project on Scala 3:

[info] RepoBenchmark.long_mapIntInt 10000 avgt 17112.728 ± 18.313 ns/op
[info] RepoBenchmark.short_mapIntInt 10000 avgt 1302.740 ± 26.203 ns/op

Conclusions

  • In Scala 2 with static inlining enabled, there is no measurable difference: both implementations collapse into the same tight loop.
  • Without inlining, the short version is clearly faster because it is easier to inline automatically (by JIT).
  • In Scala 3 the performance depends on whether the JIT succeeds in inlining the lambda, but here again the short version performs better.

So I would suggest sticking with the short version of map — it’s more predictable and performant regardless of compiler options. And besides, as far as I know, Scala 3 does not have special compiler options for inlining.

If anyone has ideas why Szeiger’s 2018 results differ so much from what we see today (changes in compiler/JIT, methodology, or hidden parameters I missed), I’d love to hear your thoughts.

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /