Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

String dtype: backwards compatibility of selecting "object" vs "str" columns in select_dtypes #61916

Open
Labels
StringsString extension data type and string data
Milestone
@jorisvandenbossche

Description

We provide the DataFrame.select_dtypes() method to easily subset columns based on data types (groups). See https://pandas.pydata.org/pandas-docs/version/2.3/user_guide/basics.html#selecting-columns-based-on-dtype

At the moment, as documented, the select string columns you must use the object dtype:

>>> pd.options.future.infer_string = False
>>> df = pd.DataFrame(
... {
... "string": list("abc"),
... "int64": list(range(1, 4)),
... }
... )
>>> df.dtypes
string object
int64 int64
dtype: object
>>> df.select_dtypes(include=[object])
 string
0 a
1 b
2 c

On current main, with the string dtype enabled, the above dataframe now has a str column, and so selecting object dtype columns gives an empty result. One can use str instead:

>>> pd.options.future.infer_string = True
>>> df = pd.DataFrame(
... {
... "string": list("abc"),
... "int64": list(range(1, 4)),
... }
... )
>>> df.dtypes
string str
int64 int64
dtype: object
>>> df.select_dtypes(include=[object])
Empty DataFrame
Columns: []
Index: [0, 1, 2]
>>> df.select_dtypes(include=[str])
 string
0 a
1 b
2 c

On the one hand, that is an "obvious" behaviour change as a consequence of the column now having a different dtype. But on the other hand, this will also break all code currently using select_dtypes to select string columns (and potentially silently, since it just no longer select them).

How to write compatible code?

One can select both object and string dtypes, so you select those columns in both older and newer pandas. One gotcha is that df.select_dtypes(include=[str]) is not allowed in pandas<=2.3 ("string dtypes are not allowed, use 'object' instead"), and has to use "string" instead of "str" (although the default dtype is str ..). This will select opt-in nullable string columns as well, but so also the new default str dtype:

# this gives the same result in both infer_string=True or False
>>> df.select_dtypes(include=[object, "string"])
 string
0 a
1 b
2 c

TODO: this should be added to the migration guide in https://pandas.pydata.org/docs/dev/user_guide/migration-3-strings.html#the-dtype-is-no-longer-object-dtype (update -> #62403)

Can we make this upgrade experience smoother?

Given that this will essentially break every use case of select_dtypes that involves selecting string columns (and given the fact this is a method, so we are more flexible compared to ser.dtype == object), I am wondering if we should provide some better upgrading behaviour. Some options:

  • For now let select_dtypes(include=[object]) keep selecting string columns as well, for backwards compatibility (and we can (later) add a warning we will stop doing that in the future)
  • When a user does select_dtypes(include=[object]) in pandas 3.0, and we see that there are str columns, raise a warning mentioning to the user they likely want to do include=[str] instead.

For both cases, it gets annoying if you actually want to select object columns, because then you have a (false positive) warning that you can't really do anything about (except ignoring/suppressing)

And in any case, we should probably still add a warning to pandas 2.3 about this when the string mode is enabled (for if we do a 2.3.2 release)

Metadata

Metadata

Assignees

No one assigned

    Labels

    StringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      AltStyle によって変換されたページ (->オリジナル) /