Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH: Create Better IntervalDtype using PyArrow structs. #53033

Open
Labels
Arrowpyarrow functionality Enhancement IntervalInterval data type
@randolf-scholz

Description

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, pandas.IntervalArray suffer from 3 major limitations:

  1. (削除) They are limited to data with the same closedness on both sides. (削除ここまで) no longer the case apparently
  2. All datapoints are limited to the same closedness in the array. (i.e. the same array can only store closed intervals or only open intervals).
  3. Intervals do not allow missing values
    • In particular one cannot represent unbounded intervals for data types that lack an actual infinity value like int32.
  4. Some dtypes are not allowed like string

As a practical application for (1) that I am very interested in is storing information about the range of valid values for the columns of another DataFrame.

Feature Description

Given the better integration with pyarrow since 2.0, we can recreate IntervalDtype using pyarrow.struct:

import pyarrow as pa
def arrow_interval_dtype(subtype):
 fields = [
 ("lower_bound", subtype),
 ("upper_bound", subtype),
 ("lower_inclusive", pa.bool_()),
 ("upper_inclusive", pa.bool_()),
 ]
 return pa.struct(fields)

Contrary to the current IntervalDtype, this would solve all 3 major problems at once:

  1. Each element of the resulting StructArray can have separate closedness
  2. Pyarrow datatypes all support missing values
  3. We can in principle use any ordered data type for the subtype.

Alternative Solutions

None.

Additional Context

Additionally, common request is adding extra operations for interval dtypes:

Additionally, one could imagine having a IntervalUnion type, that can represent finite unions of intervals, combining the interval type discussed here with pyarrow list-type. This type would naturally arise when performing unions of intervals, such as [0, 2]∪[3, 5]. The nice thing here is that the resulting space is mathematically closed under the standard set operations (union, intersection, complements, difference)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionality Enhancement IntervalInterval data type

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /