-
Notifications
You must be signed in to change notification settings - Fork 0
refactor: wire string-with-nulls columns to marrow StringArray (remove LegacyObjectData demotion path) #752
Description
Background
LegacyObjectData is the fallback storage arm for columns that marrow can't represent. It is legitimately needed for object_ / datetime64[ns] / timedelta64[ns] columns (tracked separately in #622).
However it is also used for string columns that have nulls, with the comment:
"Marrow cannot build a string + null array"
This comment is no longer accurate. Marrow's StringBuilder has had append_null() support from the start, and the README benchmarks nullable string arrays explicitly.
Current demotion path
Column.set_null_mask (line 5277–5283) demotes a string AnyArray column to LegacyObjectData when nulls are applied, wrapping each String in a PythonObject:
if self.is_string(): var src = self._str_list() var py_data = List[PythonObject](capacity=len(src)) for i in range(len(src)): py_data.append(PythonObject(src[i])) self._storage = ColumnStorage(LegacyObjectData(py_data^, mask^)) return
Fix
Replace the demotion with a proper nullable StringArray build via StringBuilder:
if self.is_string(): var src = self._str_list() var b = StringBuilder(len(src)) for i in range(len(src)): if mask.is_null(i): b.append_null() else: b.append(src[i]) self._storage = ColumnStorage(AnyArray(b.finish())) return
Follow-on cleanup
Once no string columns can land in LegacyObjectData, remove the string-with-nulls special-case branches in:
_str_list()(theLegacyObjectDatastring-extraction path)has_nulls()/null_mask_copy()(comments about string-with-nulls)- Any other guards that branch on
is_string() and _storage.isa[LegacyObjectData]()
LegacyObjectData itself is retained for object/datetime/timedelta columns.
See also #622 for the longer-term migration of datetime/timedelta away from LegacyObjectData.