Tidyr is an R package for creating tidy data.
(tl;dr)
In short, it provides an interface (gather
) for restructuring a table into a "tidy" format. For our purposes here, it suffices to explain that each row should only contain a value for one observation. The row could contain several fields, but these are fixed factors / categorical variables (at least for the sake of this post.)
Understanding the transformation
Tidyr is intended for tabular data. However, I'm working with D3.js, in which data rows are represented as objects, not arrays.
We'd like to be able to do the following in JavaScript:
const data = [
{Factor1: 'x1', A: 1, B: 74, C: 0.3},
{Factor1: 'x2', A: 2, B: 89, C: 0.12},
{Factor1: 'x3', A: 3, B: 30, C: 0.5}
];
const fields = ['A', 'B', 'C'];
const out = gather(data, 'Factor2', 'Value', ...fields);
out.forEach(item => { console.log(item); });
...which should produce...
Object {Factor2: "A", Value: 1, Factor1: "x1"}
Object {Factor2: "B", Value: 74, Factor1: "x1"}
Object {Factor2: "C", Value: 0.3, Factor1: "x1"}
Object {Factor2: "A", Value: 2, Factor1: "x2"}
Object {Factor2: "B", Value: 89, Factor1: "x2"}
Object {Factor2: "C", Value: 0.12, Factor1: "x2"}
Object {Factor2: "A", Value: 3, Factor1: "x3"}
Object {Factor2: "B", Value: 30, Factor1: "x3"}
Object {Factor2: "C", Value: 0.5, Factor1: "x3"}
Gather
Again, gather
provides an interface for performing such transformations. Here's the signature of the original function, from Tidyr's documentation:
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
Above, key
and value
are the names of the two resulting columns. The column labeled by the key
parameter is to take on the names of the headings of the altered columns. This is Factor3
above. Similarly, the column labeled by the value
parameter is to take on the values of the same column's value... this is Value
in the example above.
Anyway, the code is otherwise the same as originally posted. Here's my version of gather
:
function with_fields(record, fields) {
// Returns record with only properties specified in fields
return fields.reduce((acc, key) =>
{acc[key] = record[key]; return acc;}, {});
}
function split_record(record, ...fields) {
let with_f = with_fields(record, fields);
let other_fields = Object.keys(record)
.filter(key => !(fields.includes(key)));
let without_f = with_fields(record, other_fields);
return [without_f, with_f];
}
function gather(data, key_label, value_label, ...columns) {
// Convert wide JSON representation of CSV into long format
let lengthen = record => Object.keys(record)
.map(key => ({[key_label]: key,
[value_label]: record[key]}));
return data
// Separate columns to be made long
.map(record => split_record(record, ...columns))
.map(([left, right]) => [left, lengthen(right)]) // lengthen
// nested arrays of long records
.map(([left, right]) => right
.map(long_pair => Object.assign({}, long_pair, left)))
.reduce((acc, arr) => acc.concat(arr), []); // flatten
}
Note
- My implementation of
gather
is not as fully-featured as the original (e.g. no slicing columns out of the data set by index.) - I've also renamed
key
andvalue
tokey_label
andvalue_label
so that it's clear that labels - not an actual key:value pair - are being provided as input. - Finally, I had some errors in the original example code that I've fixed. The original post was somewhat broken, so I've rewritten it to avoid a lock. @MikeBrant isn't crazy. I've also implemented a number of his suggestions in this gist.
1 Answer 1
First, I would be concerned about the performance of your three consecutive map operations followed by reduce, particularly as the size of data
grows in terms of number of items in array.
Right now, you have to iterate data
4 times to complete all those operations. While the 3x mapping operation probably does provide some clarity to the operations by breaking them down to simple transformation steps, I would consider creating a single map
callback that could complete all steps of transformation on an item in one pass, minimizing repetitive iteration of the array. Of course you would probably want to run some performance tests for your expected use cases to understand the performance trade-offs here. It might be that a usage pattern like the following makes sense:
data.map( record => {
// mapping step 1
// mapping step 2
// mapping step 3
return mappedRecord;
}
Maybe the usage example you show is not very meaningful, but I am having a hard time understanding function signature for gather()
.
- What is the difference between the two key and value labels parameters and the columns specified passed from
columns
array parameter? - The signature does not make it clear to me as to which fields in the input data are to be assigned to the
Letter
andValue
properties (for this example) in the output array. - The parameter naming doesn't seem to make sense. Why is one called
key_label
and one calledvalue_label
when both are used as keys in the output structure? - From looking at this signature, how is one to determine what the logic is that is to be applied in "splitting" apart the input records?
Do the with_fields()
and split_record()
methods have any value outside the context of the gather()
function? If not, consider nesting them inside gather as "private" functions only in context of gather()
;
Stylistically:
- I don't like your use of snake_case in javascript, as it is common practice in JS world to use camelCase.
- Consider sticking to comments before lines of code to they apply, rather than using commments at the end of the line (which I generally avoid, as I feel they make code harder to read).
- There are a few cases where variable naming is not that meaningful -
without_f
,with_f
. Dropping a few characters from a variable is oftentimes not worth the value you get from having clearly understandable variable names. - When you get a more complex or longer return operation on an arrow function, you might consider using bracket syntax and/or line breaks to make the code easier to read.
For example:
.map(([left, right]) => right .map(long_pair => Object.assign({}, long_pair, left)))
Could become:
.map(([left, right]) => {
return right.map(
long_pair => Object.assign({}, long_pair, left)
);
});
// or
.map(
([left, right]) => right.map(
long_pair => Object.assign({}, long_pair, left)
)
);
To me, this makes it much clearer that there is a nested mapping operation happening here. This is much easier than trying to count/balance opening closing parenthesis in your head as one might have to do when looking at original code that ends with three closing parenthesis in a row.
-
\$\begingroup\$ 1. I didn't give much thought to the overhead of obtaining an iterator, thanks for pointing that out!. 2. Day job has a lot of Python, hence
_
. Can update to camelCase. 3.with_f
becausewith
is reserved andwith_fields
is used where it makes sense. That doesn't justify doing that; I'll just have to think up a more meaningful name. \$\endgroup\$eenblam– eenblam2016年10月28日 17:52:07 +00:00Commented Oct 28, 2016 at 17:52 -
\$\begingroup\$ I think a doc string would go a long way towards removing confusion regarding the signature of
gather
... but I think understanding the goal of the function should make this intuitive. Look at how the data is reshaped (or follow the link to the dplyr/tidyr cheatsheet.) Thekey_label
is the label/title for the column of keys/property names, and thevalue_label
is the label for the column of values. In the example, the choice ofValue
is more or less in keeping with the idea of "tidy" data, and the choice ofLetter
is due to the pathological choice of test case. \$\endgroup\$eenblam– eenblam2016年10月28日 17:59:01 +00:00Commented Oct 28, 2016 at 17:59 -
\$\begingroup\$ The signature for
gather
in thetidyr
docs isgather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)
. (I could only post two links due to low rep.) I'll add a revised version to the bottom of the post reflecting your comments. Thanks again! \$\endgroup\$eenblam– eenblam2016年10月28日 18:02:06 +00:00Commented Oct 28, 2016 at 18:02 -
1\$\begingroup\$ @Ben I am actually even more confused by your new example. Now I don't see how the
Letter
value passed as 2nd parameter relates to the output data structure at all. Where doesFactor3
come from? Maybe it would help to show contents offields
parameter. I am reviewing this without context of whattidyr
does, so the overall logic of your transformation still seems unclear. That is actually probably a good perspective when reviewing a portion of code, as you ideally want the purpose and usage of code to be clearly understandable based on how it is structured, commented, and named. \$\endgroup\$Mike Brant– Mike Brant2016年10月28日 18:59:24 +00:00Commented Oct 28, 2016 at 18:59 -
\$\begingroup\$ @Ben That being said, if you are attempting to mimic a method signature of a known function in your particular programming space, there is probably value in that as well, so you can take my comments in this area with that large grain of salt. \$\endgroup\$Mike Brant– Mike Brant2016年10月28日 19:00:27 +00:00Commented Oct 28, 2016 at 19:00
gather-vanilla.js
the variableskeptFields
andwideFields
are changed in lines 25 and 27.javascript return data.map(record => { let [keptFields, wideFields] = splitRecord(record, ...columns); let longFields = lengthen(keptFields); //here let nestedArrays = longFields.map( longField => Object.assign({}, longField, wideFields) // and here ); return nestedArrays; }).reduce((acc, arr) => acc.concat(arr), []);
\$\endgroup\$