-
Notifications
You must be signed in to change notification settings - Fork 81
Modifying nested column will have no effect #207
Description
To reproduce:
import torcharrow as ta import torcharrow.dtypes as dt dtype = dt.Struct( [ dt.Field("labels", dt.int8), dt.Field("dense_features", dt.Struct([dt.Field("int_1", dt.int32), dt.Field("int_2", dt.int32)])), ] ) df = ta.DataFrame( [ (1, (0, 1)), (0, (10, 11)), (1, (20, 21)), ], dtype=dtype)
Now df looksl like:
>>> df index labels dense_features ------- -------- ---------------- 0 1 (0, 1) 1 0 (10, 11) 2 1 (20, 21) dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0
Try to change df["dense_features"]["int_1"]
(and failed):
>>> df["dense_features"]["int_1"] = df["dense_features"]["int_1"] + 1 >>> df index labels dense_features ------- -------- ---------------- 0 1 (0, 1) 1 0 (10, 11) 2 1 (20, 21) dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0
For now, the work around is to first get the nested DF out, apply the transformation, and then put it back:
The problem is DataFrameCpu._set_field_data
generates a new RowVector
and copy the column vector pointer -- for a nested RowVector, it only updates the leaf level struct but doesn't propagate upwards: https://github.com/facebookresearch/torcharrow/blob/6d2bca82e65f74193360bd06c5ab4f8c761c5342/torcharrow/velox_rt/dataframe_cpu.py#L310-L329
Creating a new RowVector
seems necessary, since assigning column to DataFrame may change the children column type. One idea would be allowing the wrapped RowColumn
to change the delegated RowVector
(e.g. something like self._data._reset_data(new_delegate)
) . -- Basically DataFrame
is a thin wrapper and everything is in RowColumn
.
For this to work, DataFrame.dtype
should always use the underlying Velox Vector's type as groundtruth.