This repository was archived by the owner on Nov 1, 2024. It is now read-only.

Modifying nested column will have no effect #207

Open

Labels

bug

@wenleix

Description

@wenleix

wenleix

opened

on Feb 18, 2022

To reproduce:

import torcharrow as ta
import torcharrow.dtypes as dt
dtype = dt.Struct(
 [
 dt.Field("labels", dt.int8),
 dt.Field("dense_features", dt.Struct([dt.Field("int_1", dt.int32), dt.Field("int_2", dt.int32)])),
 ]
)
df = ta.DataFrame(
 [
 (1, (0, 1)),
 (0, (10, 11)),
 (1, (20, 21)),
 ],
 dtype=dtype)

Now df looksl like:

>>> df
 index labels dense_features
------- -------- ----------------
 0 1 (0, 1)
 1 0 (10, 11)
 2 1 (20, 21)
dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0

Try to change df["dense_features"]["int_1"] (and failed):

>>> df["dense_features"]["int_1"] = df["dense_features"]["int_1"] + 1
>>> df
 index labels dense_features
------- -------- ----------------
 0 1 (0, 1)
 1 0 (10, 11)
 2 1 (20, 21)
dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0

For now, the work around is to first get the nested DF out, apply the transformation, and then put it back:

https://github.com/facebookresearch/torcharrow/blob/6d2bca82e65f74193360bd06c5ab4f8c761c5342/torcharrow/test/integration/test_criteo.py#L149-L157

The problem is DataFrameCpu._set_field_data generates a new RowVector and copy the column vector pointer -- for a nested RowVector, it only updates the leaf level struct but doesn't propagate upwards: https://github.com/facebookresearch/torcharrow/blob/6d2bca82e65f74193360bd06c5ab4f8c761c5342/torcharrow/velox_rt/dataframe_cpu.py#L310-L329

Creating a new RowVector seems necessary, since assigning column to DataFrame may change the children column type. One idea would be allowing the wrapped RowColumn to change the delegated RowVector (e.g. something like self._data._reset_data(new_delegate)) . -- Basically DataFrame is a thin wrapper and everything is in RowColumn.

For this to work, DataFrame.dtype should always use the underlying Velox Vector's type as groundtruth.

Metadata

Assignees

No one assigned

Labels

bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modifying nested column will have no effect #207

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions