System/Language: I'm using plain Javascript to write this algorithm. Environment: Node 17.4.0 with a system that has ~7000 MiB of Memory and ~400GB of Storage.
Input Data
- An (
AoP
) Array ofPair
s (a Pair is an array of two elements) where the first element of each Pair is aString
and the second element is aNumber
. - The program can ASSUME that the input AoP is sorted in ascending order by the first element of each Pair - the String.
Example of the Input Data
[ ["A", 3], ["A", 2], ["B", 9], ["C", 1], ["C", 2], ["C", 3], ["D", 2], ["E", 0], ["E", 1] ]
Algorithm
Reduce the input AOP such that all Pairs that have the same String element are merged into one pair with the same string. The Number element should be the sum of all Number elements of the Pairs merged.
Example of the Output Data
[ ["A", 5], ["B", 9], ["C", 6], ["D", 2], ["E", 1] ]
Initial Algorithm
I'm more comfortable writing functions in recursive form (coming from a Scheme background) so I quickly came up with the following:
function sumSameKeysSortedRecursive(arr) {
if (arr.length === 0) {
return [];
} else {
let [fst, ...rst] = arr;
let recurAns = sumSameKeysSortedRecursive(rst);
if (recurAns.length === 0) {
return [fst];
} else {
let [f, ...r] = recurAns;
return f[0] === fst[0]
? [[f[0], f[1] + fst[1]], ...r]
: [fst, ...recurAns];
}
}
}
However, my input data has 11 million lines and won't scale with the recursive version (even a tail-recursive version won't work on Node). The second version uses loops:
function sumSameKeysSortedLoop(arr) {
let res = [];
let i = 0; // tracks first occurrence a pair with a unique String
let j = 0; // tracks all subsequent occurrences of Pairs with the same String
let len = arr.length;
while (i < len) {
let sum = 0;
let iname = arr[i][0];
let jname = iname;
while (jname === iname) {
sum = sum + arr[j][1];
j++;
if (j === len) { // have reached beyond the array bound
break;
} else {
jname = arr[j][0];
}
}
res.push([iname, sum]);
i = j; // jump over to the new unique element
if (i === len) {
break;
} else {
iname = arr[i][0];
}
}
return res;
}
Possible Improvements?
I think v2 has space for improvement in at least 2 areas:
- Readability: In v1, I can read and understand exactly what is happening (can improve it further by adding helper functions), while in v2 a future reader of the code might take more time understanding it. (Could be subjective)
- while + break;: Is there an easy way to translate the program such that its functionality stays the same but uses a for-loop instead? Or where it does not use
break;
(a more general version)? Or where it does not have to keep track of so many variables (i, j, sum, iname, jname)?
-
\$\begingroup\$ Of the 11m lines, do you have a sense for how often the string element repeats? Is your output going to be closer to 10m or to 100? \$\endgroup\$Teepeemm– Teepeemm2022年02月09日 21:27:50 +00:00Commented Feb 9, 2022 at 21:27
-
\$\begingroup\$ Expecting my output to be closer to 550k. I'm processing certain kind of transaction on a blockchain history and am assuming 20 such transaction per wallet. \$\endgroup\$Atharva Shukla– Atharva Shukla2022年02月09日 21:40:25 +00:00Commented Feb 9, 2022 at 21:40
2 Answers 2
I read your iterative (loop) solution. It's complicated; it's hard to read.
Here's a simplified version of your code.
function merge(aop) {
const sums = [];
if (aop.length === 0) {
return sums;
}
const str = aop[0][0];
let end = sums.push([str, 0]) - 1;
for (const [str, num] of aop) {
if (str === sums[end][0]) {
sums[end][1] += num;
} else {
end = sums.push([str, num]) - 1;
}
}
return sums;
}
const example = [ ["A", 3], ["A", 2], ["B", 9], ["C", 1], ["C", 2], ["C", 3], ["D", 2], ["E", 0], ["E", 1] ];
console.log(merge(example));
[ [ 'A', 8 ], [ 'B', 9 ], [ 'C', 6 ], [ 'D', 2 ], [ 'E', 1 ] ]
-
\$\begingroup\$ Thank you! This is way cleaner \$\endgroup\$Atharva Shukla– Atharva Shukla2022年02月09日 21:26:07 +00:00Commented Feb 9, 2022 at 21:26
-
1\$\begingroup\$ The
arr.push(x) -1
part was a little confusing but once I understood that it represents the index of the last element of the (updated)arr
, it was easy to understand. \$\endgroup\$Atharva Shukla– Atharva Shukla2022年02月09日 21:42:56 +00:00Commented Feb 9, 2022 at 21:42
Why is it slow
Your first code tries to make a slice on the input array every iteration. It cost too much time to duplicate the list. And it runs in \$O(n^2)\$ complexity.
Suggested rewriting
Your second approach should work as expected. Although it is a bit complex since you need a nested loop and modify the loop variable in the loop with special logic. I would suggest avoiding modifying the loop variable so it becomes easy to understand. You just need to add the previous pair if the current one is equal to the previous one.
function sumSameKeysSortedRecursive(arr) {
if (!arr.length) {
return [];
}
let prev = [arr[0][0], 0];
const output = [prev];
for (let i = 0, l = arr.length; i < l; i++) {
if (arr[i][0] === prev[0]) {
prev[1] += arr[i][1];
} else {
prev = [...arr[i]];
output.push(prev);
}
}
return output;
}
Or using flatMap
if you prefer:
function sumSameKeysSortedRecursive(arr) {
let prev = null;
return arr.flatMap(([code, value]) => {
if (prev && prev[0] === code) {
prev[1] += value;
return [];
} else {
prev = [code, value];
return [prev];
}
});
}