This repository was archived by the owner on Nov 1, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 81
This repository was archived by the owner on Nov 1, 2024. It is now read-only.
Efficient kernel implementation for drop_duplicates
and sort
#88
Open
Labels
@wenleix
Description
For single colulmn, delegating to Arrow Array seems to be a good initial support. Similar to #64 and #53
Arrow arrays supports unique
: https://arrow.apache.org/docs/python/generated/pyarrow.compute.unique.html#pyarrow.compute.unique
>>> import pyarrow as pa >>> a = pa.array([1, 2, 3, 2]) >>> a.unique() <pyarrow.lib.Int64Array object at 0x7f89a065ed00> [ 1, 2, 3 ]
For sort, looks like first needed to get sorting indexing, and then reorder the elements: https://arrow.apache.org/docs/python/api/compute.html#sorts-and-partitions, and then use array selection methods: https://arrow.apache.org/docs/python/api/compute.html#selections
>>> import pyarrow as pa >>> import pyarrow.compute as pac >>> a = pa.array([1, 5, 7, 3, 2]) >>> pac.take(a, pac.array_sort_indices(a)) <pyarrow.lib.Int64Array object at 0x7f89a065edc0> [ 1, 2, 3, 5, 7 ]