Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
This repository was archived by the owner on Nov 1, 2024. It is now read-only.
This repository was archived by the owner on Nov 1, 2024. It is now read-only.

Modifying nested column will have no effect #207

Open
Labels
bugSomething isn't working
@wenleix

Description

To reproduce:

import torcharrow as ta
import torcharrow.dtypes as dt
dtype = dt.Struct(
 [
 dt.Field("labels", dt.int8),
 dt.Field("dense_features", dt.Struct([dt.Field("int_1", dt.int32), dt.Field("int_2", dt.int32)])),
 ]
)
df = ta.DataFrame(
 [
 (1, (0, 1)),
 (0, (10, 11)),
 (1, (20, 21)),
 ],
 dtype=dtype)

Now df looksl like:

>>> df
 index labels dense_features
------- -------- ----------------
 0 1 (0, 1)
 1 0 (10, 11)
 2 1 (20, 21)
dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0

Try to change df["dense_features"]["int_1"] (and failed):

>>> df["dense_features"]["int_1"] = df["dense_features"]["int_1"] + 1
>>> df
 index labels dense_features
------- -------- ----------------
 0 1 (0, 1)
 1 0 (10, 11)
 2 1 (20, 21)
dtype: Struct([Field('labels', int8), Field('dense_features', Struct([Field('int_1', int32), Field('int_2', int32)]))]), count: 3, null_count: 0

For now, the work around is to first get the nested DF out, apply the transformation, and then put it back:

https://github.com/facebookresearch/torcharrow/blob/6d2bca82e65f74193360bd06c5ab4f8c761c5342/torcharrow/test/integration/test_criteo.py#L149-L157

The problem is DataFrameCpu._set_field_data generates a new RowVector and copy the column vector pointer -- for a nested RowVector, it only updates the leaf level struct but doesn't propagate upwards: https://github.com/facebookresearch/torcharrow/blob/6d2bca82e65f74193360bd06c5ab4f8c761c5342/torcharrow/velox_rt/dataframe_cpu.py#L310-L329

Creating a new RowVector seems necessary, since assigning column to DataFrame may change the children column type. One idea would be allowing the wrapped RowColumn to change the delegated RowVector (e.g. something like self._data._reset_data(new_delegate)) . -- Basically DataFrame is a thin wrapper and everything is in RowColumn.

For this to work, DataFrame.dtype should always use the underlying Velox Vector's type as groundtruth.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /