Naive JS implementation of tidyr's gather function

Question 1

Tidyr is an R package for creating tidy data. (tl;dr) In short, it provides an interface (gather) for restructuring a table into a "tidy" format. For our purposes here, it suffices to explain that each row should only contain a value for one observation. The row could contain several fields, but these are fixed factors / categorical variables (at least for the sake of this post.)

Understanding the transformation

Tidyr is intended for tabular data. However, I'm working with D3.js, in which data rows are represented as objects, not arrays.

We'd like to be able to do the following in JavaScript:

const data = [
 {Factor1: 'x1', A: 1, B: 74, C: 0.3},
 {Factor1: 'x2', A: 2, B: 89, C: 0.12},
 {Factor1: 'x3', A: 3, B: 30, C: 0.5}
 ];
const fields = ['A', 'B', 'C'];
const out = gather(data, 'Factor2', 'Value', ...fields);
out.forEach(item => { console.log(item); });

...which should produce...

Object {Factor2: "A", Value: 1, Factor1: "x1"}
Object {Factor2: "B", Value: 74, Factor1: "x1"}
Object {Factor2: "C", Value: 0.3, Factor1: "x1"}
Object {Factor2: "A", Value: 2, Factor1: "x2"}
Object {Factor2: "B", Value: 89, Factor1: "x2"}
Object {Factor2: "C", Value: 0.12, Factor1: "x2"}
Object {Factor2: "A", Value: 3, Factor1: "x3"}
Object {Factor2: "B", Value: 30, Factor1: "x3"}
Object {Factor2: "C", Value: 0.5, Factor1: "x3"}

Gather

Again, gather provides an interface for performing such transformations. Here's the signature of the original function, from Tidyr's documentation:

gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

Above, key and value are the names of the two resulting columns. The column labeled by the key parameter is to take on the names of the headings of the altered columns. This is Factor3 above. Similarly, the column labeled by the value parameter is to take on the values of the same column's value... this is Value in the example above.

Anyway, the code is otherwise the same as originally posted. Here's my version of gather:

function with_fields(record, fields) {
 // Returns record with only properties specified in fields
 return fields.reduce((acc, key) =>
 {acc[key] = record[key]; return acc;}, {});
}
function split_record(record, ...fields) {
 let with_f = with_fields(record, fields);
 let other_fields = Object.keys(record)
 .filter(key => !(fields.includes(key)));
 let without_f = with_fields(record, other_fields);
 return [without_f, with_f];
}
function gather(data, key_label, value_label, ...columns) {
 // Convert wide JSON representation of CSV into long format
 let lengthen = record => Object.keys(record)
 .map(key => ({[key_label]: key,
 [value_label]: record[key]}));
 return data
 // Separate columns to be made long
 .map(record => split_record(record, ...columns))
 .map(([left, right]) => [left, lengthen(right)]) // lengthen
 // nested arrays of long records
 .map(([left, right]) => right
 .map(long_pair => Object.assign({}, long_pair, left)))
 .reduce((acc, arr) => acc.concat(arr), []); // flatten
}

Note

My implementation of gather is not as fully-featured as the original (e.g. no slicing columns out of the data set by index.)
I've also renamed key and value to key_label and value_label so that it's clear that labels - not an actual key:value pair - are being provided as input.
Finally, I had some errors in the original example code that I've fixed. The original post was somewhat broken, so I've rewritten it to avoid a lock. @MikeBrant isn't crazy. I've also implemented a number of his suggestions in this gist.

Question 2

@Jonah: I added a note regarding that, and I added a more detailed use case. The original was meant to be easier on the eyes, but I guess it was overly simplified.

Question 3

In the gather-vanilla.js the variables keptFields and wideFields are changed in lines 25 and 27.

javascript return data.map(record => { let [keptFields, wideFields] = splitRecord(record, ...columns); let longFields = lengthen(keptFields); //here let nestedArrays = longFields.map( longField => Object.assign({}, longField, wideFields) // and here ); return nestedArrays; }).reduce((acc, arr) => acc.concat(arr), []);

Question 4

First, I would be concerned about the performance of your three consecutive map operations followed by reduce, particularly as the size of data grows in terms of number of items in array.

Right now, you have to iterate data 4 times to complete all those operations. While the 3x mapping operation probably does provide some clarity to the operations by breaking them down to simple transformation steps, I would consider creating a single map callback that could complete all steps of transformation on an item in one pass, minimizing repetitive iteration of the array. Of course you would probably want to run some performance tests for your expected use cases to understand the performance trade-offs here. It might be that a usage pattern like the following makes sense:

data.map( record => {
 // mapping step 1
 // mapping step 2
 // mapping step 3
 return mappedRecord;
}

Maybe the usage example you show is not very meaningful, but I am having a hard time understanding function signature for gather().

What is the difference between the two key and value labels parameters and the columns specified passed from columns array parameter?
The signature does not make it clear to me as to which fields in the input data are to be assigned to the Letter and Value properties (for this example) in the output array.
The parameter naming doesn't seem to make sense. Why is one called key_label and one called value_label when both are used as keys in the output structure?
From looking at this signature, how is one to determine what the logic is that is to be applied in "splitting" apart the input records?

Do the with_fields() and split_record() methods have any value outside the context of the gather() function? If not, consider nesting them inside gather as "private" functions only in context of gather();

Stylistically:

I don't like your use of snake_case in javascript, as it is common practice in JS world to use camelCase.
Consider sticking to comments before lines of code to they apply, rather than using commments at the end of the line (which I generally avoid, as I feel they make code harder to read).
There are a few cases where variable naming is not that meaningful - without_f, with_f. Dropping a few characters from a variable is oftentimes not worth the value you get from having clearly understandable variable names.
When you get a more complex or longer return operation on an arrow function, you might consider using bracket syntax and/or line breaks to make the code easier to read.

For example:

.map(([left, right]) => right
 .map(long_pair => Object.assign({}, long_pair, left)))

Could become:

.map(([left, right]) => {
 return right.map(
 long_pair => Object.assign({}, long_pair, left)
 );
});
// or
.map(
 ([left, right]) => right.map(
 long_pair => Object.assign({}, long_pair, left)
 )
 );

To me, this makes it much clearer that there is a nested mapping operation happening here. This is much easier than trying to count/balance opening closing parenthesis in your head as one might have to do when looking at original code that ends with three closing parenthesis in a row.

Question 5

1. I didn't give much thought to the overhead of obtaining an iterator, thanks for pointing that out!. 2. Day job has a lot of Python, hence _. Can update to camelCase. 3. with_f because with is reserved and with_fields is used where it makes sense. That doesn't justify doing that; I'll just have to think up a more meaningful name.

Question 6

I think a doc string would go a long way towards removing confusion regarding the signature of gather... but I think understanding the goal of the function should make this intuitive. Look at how the data is reshaped (or follow the link to the dplyr/tidyr cheatsheet.) The key_label is the label/title for the column of keys/property names, and the value_label is the label for the column of values. In the example, the choice of Value is more or less in keeping with the idea of "tidy" data, and the choice of Letter is due to the pathological choice of test case.

Question 7

The signature for gather in the tidyr docs is gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE). (I could only post two links due to low rep.) I'll add a revised version to the bottom of the post reflecting your comments. Thanks again!

Question 8

@Ben I am actually even more confused by your new example. Now I don't see how the Letter value passed as 2nd parameter relates to the output data structure at all. Where does Factor3 come from? Maybe it would help to show contents of fields parameter. I am reviewing this without context of what tidyr does, so the overall logic of your transformation still seems unclear. That is actually probably a good perspective when reviewing a portion of code, as you ideally want the purpose and usage of code to be clearly understandable based on how it is structured, commented, and named.

Question 9

@Ben That being said, if you are attempting to mimic a method signature of a known function in your particular programming space, there is probably value in that as well, so you can take my comments in this area with that large grain of salt.

Mike Brant Mike Brant 9,85814 silver badges24 bronze badges · Accepted Answer · 2016-10-28 17:30:29Z

First, I would be concerned about the performance of your three consecutive map operations followed by reduce, particularly as the size of data grows in terms of number of items in array.

Right now, you have to iterate data 4 times to complete all those operations. While the 3x mapping operation probably does provide some clarity to the operations by breaking them down to simple transformation steps, I would consider creating a single map callback that could complete all steps of transformation on an item in one pass, minimizing repetitive iteration of the array. Of course you would probably want to run some performance tests for your expected use cases to understand the performance trade-offs here. It might be that a usage pattern like the following makes sense:

data.map( record => {
 // mapping step 1
 // mapping step 2
 // mapping step 3
 return mappedRecord;
}

Maybe the usage example you show is not very meaningful, but I am having a hard time understanding function signature for gather().

What is the difference between the two key and value labels parameters and the columns specified passed from columns array parameter?
The signature does not make it clear to me as to which fields in the input data are to be assigned to the Letter and Value properties (for this example) in the output array.
The parameter naming doesn't seem to make sense. Why is one called key_label and one called value_label when both are used as keys in the output structure?
From looking at this signature, how is one to determine what the logic is that is to be applied in "splitting" apart the input records?

Do the with_fields() and split_record() methods have any value outside the context of the gather() function? If not, consider nesting them inside gather as "private" functions only in context of gather();

Stylistically:

I don't like your use of snake_case in javascript, as it is common practice in JS world to use camelCase.
Consider sticking to comments before lines of code to they apply, rather than using commments at the end of the line (which I generally avoid, as I feel they make code harder to read).
There are a few cases where variable naming is not that meaningful - without_f, with_f. Dropping a few characters from a variable is oftentimes not worth the value you get from having clearly understandable variable names.
When you get a more complex or longer return operation on an arrow function, you might consider using bracket syntax and/or line breaks to make the code easier to read.

For example:

.map(([left, right]) => right
 .map(long_pair => Object.assign({}, long_pair, left)))

Could become:

.map(([left, right]) => {
 return right.map(
 long_pair => Object.assign({}, long_pair, left)
 );
});
// or
.map(
 ([left, right]) => right.map(
 long_pair => Object.assign({}, long_pair, left)
 )
 );

To me, this makes it much clearer that there is a nested mapping operation happening here. This is much easier than trying to count/balance opening closing parenthesis in your head as one might have to do when looking at original code that ends with three closing parenthesis in a row.

1. I didn't give much thought to the overhead of obtaining an iterator, thanks for pointing that out!. 2. Day job has a lot of Python, hence _. Can update to camelCase. 3. with_f because with is reserved and with_fields is used where it makes sense. That doesn't justify doing that; I'll just have to think up a more meaningful name.
I think a doc string would go a long way towards removing confusion regarding the signature of gather... but I think understanding the goal of the function should make this intuitive. Look at how the data is reshaped (or follow the link to the dplyr/tidyr cheatsheet.) The key_label is the label/title for the column of keys/property names, and the value_label is the label for the column of values. In the example, the choice of Value is more or less in keeping with the idea of "tidy" data, and the choice of Letter is due to the pathological choice of test case.
The signature for gather in the tidyr docs is gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE). (I could only post two links due to low rep.) I'll add a revised version to the bottom of the post reflecting your comments. Thanks again!
@Ben I am actually even more confused by your new example. Now I don't see how the Letter value passed as 2nd parameter relates to the output data structure at all. Where does Factor3 come from? Maybe it would help to show contents of fields parameter. I am reviewing this without context of what tidyr does, so the overall logic of your transformation still seems unclear. That is actually probably a good perspective when reviewing a portion of code, as you ideally want the purpose and usage of code to be clearly understandable based on how it is structured, commented, and named.
@Ben That being said, if you are attempting to mimic a method signature of a known function in your particular programming space, there is probably value in that as well, so you can take my comments in this area with that large grain of salt.

Stack Exchange Network

Naive JS implementation of tidyr's gather function

Understanding the transformation

Gather

Note

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Naive JS implementation of tidyr's gather function

Understanding the transformation

Gather

Note

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions