Simplified implementations of Array.foreach and Array.map improve performance by <span class=

2Pit
Aug 1, 2025

Hi!

I've been benchmarking performance of Array.map and Array.foreach in Scala and found that their implementations can be made significantly faster — map by up to ×ばつ, and foreach by up to ×ばつ — by simplifying the loop structure in cases where the function operates on statically known primitive types.

The current implementation relies on a match over array types, introducing overhead and making performance less predictable in cases chained transformations. In contrast, a plain loop yields consistent scaling and significantly lower latency per element.

Here’s how the current map method can be simplified:

// current version
def oldMap[B](f: A => B)(implicit ct: ClassTag[B]): Array[B] = {
 val len = xs.length
 val ys = new Array[B](len)
 if (len > 0) {
 var i = 0
 (xs: Any @unchecked) match {
 case xs: Array[Int] => while (i < len) { ys(i) = f(xs(i).asInstanceOf[A]); i += 1 }
 case xs: Array[Long] => while (i < len) { ys(i) = f(xs(i).asInstanceOf[A]); i += 1 }
 // other primitive types...
 }
 }
 ys
}

// proposed version
def newMap[B](f: A => B)(implicit ct: ClassTag[B]): Array[B] = {
 val len = xs.length
 val ys = new Array[B](len)
 var i = 0
 while (i < len) {
 ys(i) = f(xs(i))
 i += 1
 }
 ys
}

You can find the full analysis with plots and measurements below.
Would this kind of improvement be something you'd like to see in the standard library? I'd be happy to contribute a PR.

📊 https://github.com/2Pit/scala-benchmarks/blob/main/review/array_map_foreach/README.md

map_1

Replies: 5 comments 1 reply

lrytz
Aug 5, 2025
Maintainer

That's certainly interesting. The source code of the standard library (also the one used in Scala 3, for now) is over at https://github.com/scala/scala (Scala 2), there's also some existing infra for writing benchmarks (https://github.com/scala/scala/blob/2.13.x/test/benchmarks/README.md).

It seems to me the old code is trying to prevent primitive boxing / unboxing, I'd be interested to check in detail (in bytecode) if that wasn't actually helpful. It's possible that the situation is different on Scala 2 because Scala 3 doesn't have function specialization (right?).

1 reply

@sjrd

sjrd Aug 9, 2025
Maintainer

It seems to me the old code is trying to prevent primitive boxing / unboxing, I'd be interested to check in detail (in bytecode) if that wasn't actually helpful. It's possible that the situation is different on Scala 2 because Scala 3 doesn't have function specialization (right?).

More importantly, it avoids performing the dispatch from
https://github.com/scala/scala/blob/177b7453e142d3e0b9e5c4a2dc32736e1272c548/src/library/scala/runtime/ScalaRunTime.scala#L56-L69
in every iteration of the loop.

2Pit
Aug 9, 2025
Author

@lrytz I didn’t notice any significant differences in the bytecode between Scala 2 and Scala 3 for the same code. I think the main difference lies in the implementation of map: the old version produce fairly large bytecode with type-based branching, and according to the JIT logs, the compiler occasionally fails to inline this method (callee is too large / inlining prohibited by policy). This might prevent boxing elimination and other optimizations inside the loop.

Thank you for the benchmark links. It was interesting to go through them.

It seems to me that the current stdlib benchmarks, for example

@Benchmark def mapIntInt(bh: Blackhole): Unit =
bh.consume(integersA.map(x => 0))
@Benchmark def mapIntString(bh: Blackhole): Unit =
bh.consume(integersA.map(x => ""))

use rather simple lambdas that the JIT will most likely optimize aggressively. In real-world scenarios, where the functions inside map are more complex, the compiler’s behavior could be different.

0 replies

2Pit
Aug 9, 2025
Author

Here’s a more detailed analysis with proof:

Bytecode in benchmarks.
I analyzed the bytecode for Scala 2 and Scala 3 and found no significant differences at the call site of map.
When passing a method to map, the compiler generates a specialized lambda apply$mcII$sp (int -> int), which avoids boxing at the call site.
When passing a regular lambda, it compiles to Function1<Object, Object>, and boxing occurs during the invocation.

// method -> specialized lambda (identical for oldMap/newMap)
bh.consume(ArrayOpsCompat.ArrayOpsExt$.MODULE$.newMap$extension(
 ArrayOpsCompat$.MODULE$.ArrayOpsExt(this.array()),
 (JFunction1.mcII.sp)(x) -> this.heavy(x),
 scala.reflect.ClassTag$.MODULE$.Int()
));
// regular lambda -> (Object -> Object)
bh.consume(ArrayOpsCompat.ArrayOpsExt$.MODULE$.newMap$extension(
 ArrayOpsCompat$.MODULE$.ArrayOpsExt(this.array()),
 (Object x) -> this.genericHeavy(x),
 scala.reflect.ClassTag$.MODULE$.Object()
));

Bytecode inside method implementations.
The differences appear inside the implementations of oldMap$extension and newMap$extension.
The old version oldMap contains a type switch for all primitive and reference array types.
In the int[] branch, primitives are explicitly boxed via BoxesRunTime.boxToInteger before calling f.apply(...). This large branching structure makes the method big and harder for the JIT to optimize.

// oldMap
if ($this instanceof int[]) {
 for (int[] arr = (int[]) $this; i < len; ++i) {
 ys[i] = f.apply(BoxesRunTime.boxToInteger(arr[i]));
 }
} else if ($this instanceof double[]) {
 // branch for double[]
}
// ... other primitive array branches ...

Here in runtime JVM knows the array's type, but doesn't know f type.

The new version newMap is linear: a single loop without type matching, minimal extra code.

// newMap
for (int i = 0; i < len; i++) {
 ys[i] = f.apply(xs[i]);
}

JIT log summary.
DISCLAIMER: I'm inspecting the log manually, so it is possible that can miss something.

Analysis of the JIT logs shows that specialized lambda calls (apply$mcII$sp) were inlined in all cases.
The key difference is that oldMap sometimes fails to be fully inlined due to JIT policy/size limits (see below). This likely prevents the JIT from performing further optimizations, such as eliminating boxing inside the loop.
In contrast, newMap is smaller, allowing the JIT to inline it completely and apply all possible optimizations.

<!-- context (C2): call to benchmarks.ArrayOpsCompat$ArrayOpsExt$.oldMap$extension -->
<method id='1455' holder='1411' name='oldMap$extension' return='1222' arguments='1222 1453 1454' flags='17' bytes='599' compile_id='781' compiler='c2' level='4' iicount='696'/>
<call method='1455' instr='invokevirtual'/>
<inline_fail reason='inlining prohibited by policy'/>
<!-- context (C1): call to benchmarks.ArrayOpsCompat$ArrayOpsExt$.oldMap$extension -->
<method id='1444' holder='1398' name='oldMap$extension' return='1222' arguments='1222 1442 1443' flags='17' bytes='599' compile_id='767' compiler='c1' level='3' iicount='406'/>
<call method='1444' instr='invokevirtual'/>
<inline_fail reason='callee is too large'/>
<!-- context: call to benchmarks.ArrayOpsCompat$ArrayOpsExt$.newMap$extension -->
<method id='1421' holder='1409' name='newMap$extension' return='1222' arguments='1222 1419 1420' flags='17' bytes='63' compile_id='775' compiler='c2' level='4' iicount='636'/>
<call method='1421' count='113050' prof_factor='1.000000' inline='1'/>
<inline_success reason='inline (hot)'/>
<!-- context: specialized lambda inlining -->
<method id='1507' holder='1424' name='apply$mcII$sp' return='1215' arguments='1215' flags='1' bytes='9' compile_id='753' compiler='c1' level='3' iicount='28090'/>
<call method='1507' count='27579' prof_factor='1.000000' inline='1'/>
<inline_success reason='inline (hot)'/>

0 replies

lrytz
Aug 14, 2025
Maintainer

@2Pit see here for the history: scala/scala#7241

0 replies

2Pit
Sep 6, 2025
Author

@lrytz Sorry for the late reply — I got a bit off track, and the results of benchmarks turned out quite surprising.

First of all, I realized that the original method names were confusing. In Szeiger’s PR they were labeled the other way around, so let’s stick to this convention:
long map — version with pattern-matching on array type
short map — version with a simple loop inside

Results from Szeiger (2018)

I focused on this particular benchmark in his PR because it showed the largest difference in performance:

[info] ArrayOpsBenchmark.mapIntIntNew (long) 1000 avgt 405.494 ± 8.233 ns/op
[info] ArrayOpsBenchmark.mapIntIntOld (short) 1000 avgt 1137.030 ± 32.654 ns/op

My results (Scala 2, MacBook Air M3)

When running the benchmarks locally in scala2 project (with default optimization flags "-opt:l:inline", "-opt-inline-from:scala/**"), I didn’t observe any difference between different implementations:

[info] ArrayOpsBenchmark.long_mapIntInt 10000 avgt 1241.526 ± 42.451 ns/op
[info] ArrayOpsBenchmark.short_mapIntInt 10000 avgt 1241.768 ± 26.661 ns/op

If I disable the inline options, the difference shows up: the short version gets inlined automatically, the long one does not.

[info] ArrayOpsBenchmark.long_mapIntInt 10000 avgt 17571.405 ± 467.381 ns/op
[info] ArrayOpsBenchmark.short_mapIntInt 10000 avgt 1240.606 ± 26.588 ns/op

Results in my separate project (Scala 3)

I observed similar results in my own "speedup-array" project on Scala 3:

[info] RepoBenchmark.long_mapIntInt 10000 avgt 17112.728 ± 18.313 ns/op
[info] RepoBenchmark.short_mapIntInt 10000 avgt 1302.740 ± 26.203 ns/op

Conclusions

In Scala 2 with static inlining enabled, there is no measurable difference: both implementations collapse into the same tight loop.
Without inlining, the short version is clearly faster because it is easier to inline automatically (by JIT).
In Scala 3 the performance depends on whether the JIT succeeds in inlining the lambda, but here again the short version performs better.

So I would suggest sticking with the short version of map — it’s more predictable and performant regardless of compiler options. And besides, as far as I know, Scala 3 does not have special compiler options for inlining.

If anyone has ideas why Szeiger’s 2018 results differ so much from what we see today (changes in compiler/JIT, methodology, or hidden parameters I missed), I’d love to hear your thoughts.

0 replies

Simplified implementations of Array.foreach and Array.map improve performance by ×ばつ #23649

Uh oh!

2Pit Aug 1, 2025

Replies: 5 comments · 1 reply

Uh oh!

lrytz Aug 5, 2025 Maintainer

Uh oh!

Uh oh!

sjrd Aug 9, 2025 Maintainer

Uh oh!

Uh oh!

2Pit Aug 9, 2025 Author

Uh oh!

Uh oh!

2Pit Aug 9, 2025 Author

Uh oh!

lrytz Aug 14, 2025 Maintainer

Uh oh!

2Pit Sep 6, 2025 Author

Results from Szeiger (2018)

My results (Scala 2, MacBook Air M3)

Results in my separate project (Scala 3)

Conclusions

2Pit
Aug 1, 2025

Replies: 5 comments 1 reply

lrytz
Aug 5, 2025
Maintainer

sjrd Aug 9, 2025
Maintainer

2Pit
Aug 9, 2025
Author

2Pit
Aug 9, 2025
Author

lrytz
Aug 14, 2025
Maintainer

2Pit
Sep 6, 2025
Author