What’s new in 3.0.0 (Month XX, 2025)#

These are the changes in pandas 3.0.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

Dedicated string data type by default#

Historically, pandas represented string columns with NumPy object data type. This representation has numerous problems: it is not specific to strings (any Python object can be stored in an object-dtype array, not just strings) and it is often not very efficient (both performance wise and for memory usage).

Starting with pandas 3.0, a dedicated string data type is enabled by default (backed by PyArrow under the hood, if installed, otherwise falling back to being backed by NumPy object-dtype). This means that pandas will start inferring columns containing string data as the new str data type when creating pandas objects, such as in constructors or IO functions.

Old behavior:

>>> ser = pd.Series(["a", "b"])
0 a
1 b
dtype: object

New behavior:

>>> ser = pd.Series(["a", "b"])
0 a
1 b
dtype: str

The string data type that is used in these scenarios will mostly behave as NumPy object would, including missing value semantics and general operations on these columns.

The main characteristic of the new string data type:

  • Inferred by default for string data (instead of object dtype)

  • The str dtype can only hold strings (or missing values), in contrast to object dtype. (setitem with non string fails)

  • The missing value sentinel is always NaN (np.nan) and follows the same missing value semantics as the other default dtypes.

Those intentional changes can have breaking consequences, for example when checking for the .dtype being object dtype or checking the exact missing value sentinel. See the Migration guide for the new string data type (pandas 3.0) for more details on the behaviour changes and how to adapt your code to the new default.

Copy-on-Write#

The new "copy-on-write" behaviour in pandas 3.0 brings changes in behavior in how pandas operates with respect to copies and views. A summary of the changes:

  1. The result of any indexing operation (subsetting a DataFrame or Series in any way, i.e. including accessing a DataFrame column as a Series) or any method returning a new DataFrame or Series, always behaves as if it were a copy in terms of user API.

  2. As a consequence, if you want to modify an object (DataFrame or Series), the only way to do this is to directly modify that object itself.

The main goal of this change is to make the user API more consistent and predictable. There is now a clear rule: any subset or returned series/dataframe always behaves as a copy of the original, and thus never modifies the original (before pandas 3.0, whether a derived object would be a copy or a view depended on the exact operation performed, which was often confusing).

Because every single indexing step now behaves as a copy, this also means that "chained assignment" (updating a DataFrame with multiple setitem steps) will stop working. Because this now consistently never works, the SettingWithCopyWarning is removed.

The new behavioral semantics are explained in more detail in the user guide about Copy-on-Write.

A secondary goal is to improve performance by avoiding unnecessary copies. As mentioned above, every new DataFrame or Series returned from an indexing operation or method behaves as a copy, but under the hood pandas will use views as much as possible, and only copy when needed to guarantee the "behaves as a copy" behaviour (this is the actual "copy-on-write" mechanism used as an implementation detail).

Some of the behaviour changes described above are breaking changes in pandas 3.0. When upgrading to pandas 3.0, it is recommended to first upgrade to pandas 2.3 to get deprecation warnings for a subset of those changes. The migration guide explains the upgrade process in more detail.

pd.col syntax can now be used in DataFrame.assign() and DataFrame.loc() #

You can now use pd.col to create callables for use in dataframe methods which accept them. For example, if you have a dataframe

In [1]: df = pd.DataFrame({'a': [1, 1, 2], 'b': [4, 5, 6]})

and you want to create a new column 'c' by summing 'a' and 'b', then instead of

In [2]: df.assign(c = lambda df: df['a'] + df['b'])
Out[2]: 
 a b c
0 1 4 5
1 1 5 6
2 2 6 8

you can now write:

In [3]: df.assign(c = pd.col('a') + pd.col('b'))
Out[3]: 
 a b c
0 1 4 5
1 1 5 6
2 2 6 8

New Deprecation Policy#

pandas 3.0.0 introduces a new 3-stage deprecation policy: using DeprecationWarning initially, then switching to FutureWarning for broader visibility in the last minor version before the next major release, and then removal of the deprecated functionality in the major release. This was done to give downstream packages more time to adjust to pandas deprecations, which should reduce the amount of warnings that a user gets from code that isn’t theirs. See PDEP 17 for more details.

All warnings for upcoming changes in pandas will have the base class pandas.errors.PandasChangeWarning. Users may also use the following subclasses to control warnings.

Other enhancements#

Notable bug fixes#

These are bug fixes that might have notable behavior changes.

Improved behavior in groupby for observed=False#

A number of bugs have been fixed due to improved handling of unobserved groups (GH 55738). All remarks in this section equally impact SeriesGroupBy.

In previous versions of pandas, a single grouping with DataFrameGroupBy.apply() or DataFrameGroupBy.agg() would pass the unobserved groups to the provided function, resulting in 0 below.

In [4]: df = pd.DataFrame(
 ...:  {
 ...:  "key1": pd.Categorical(list("aabb"), categories=list("abc")),
 ...:  "key2": [1, 1, 1, 2],
 ...:  "values": [1, 2, 3, 4],
 ...:  }
 ...: )
 ...: 
In [5]: df
Out[5]: 
 key1 key2 values
0 a 1 1
1 a 1 2
2 b 1 3
3 b 2 4
In [6]: gb = df.groupby("key1", observed=False)
In [7]: gb[["values"]].apply(lambda x: x.sum())
Out[7]: 
 values
key1 
a 3
b 7
c 0

However this was not the case when using multiple groupings, resulting in NaN below.

In [1]: gb = df.groupby(["key1", "key2"], observed=False)
In [2]: gb[["values"]].apply(lambda x: x.sum())
Out[2]:
 values
key1 key2
a 1 3.0
 2 NaN
b 1 3.0
 2 4.0
c 1 NaN
 2 NaN

Now using multiple groupings will also pass the unobserved groups to the provided function.

In [8]: gb = df.groupby(["key1", "key2"], observed=False)
In [9]: gb[["values"]].apply(lambda x: x.sum())
Out[9]: 
 values
key1 key2 
a 1 3
 2 0
b 1 3
 2 4
c 1 0
 2 0

Similarly:

These improvements also fixed certain bugs in groupby:

notable_bug_fix2#

Backwards incompatible API changes#

Datetime resolution inference#

Converting a sequence of strings, datetime objects, or np.datetime64 objects to a datetime64 dtype now performs inference on the appropriate resolution (AKA unit) for the output dtype. This affects Series, DataFrame, Index, DatetimeIndex, and to_datetime().

Previously, these would always give nanosecond resolution:

In [1]: dt = pd.Timestamp("2024年03月22日 11:36").to_pydatetime()
In [2]: pd.to_datetime([dt]).dtype
Out[2]: dtype('<M8[ns]')
In [3]: pd.Index([dt]).dtype
Out[3]: dtype('<M8[ns]')
In [4]: pd.DatetimeIndex([dt]).dtype
Out[4]: dtype('<M8[ns]')
In [5]: pd.Series([dt]).dtype
Out[5]: dtype('<M8[ns]')

This now infers the unit microsecond unit "us" from the pydatetime object, matching the scalar Timestamp behavior.

In [10]: In [1]: dt = pd.Timestamp("2024年03月22日 11:36").to_pydatetime()
In [11]: In [2]: pd.to_datetime([dt]).dtype
Out[11]: dtype('<M8[us]')
In [12]: In [3]: pd.Index([dt]).dtype
Out[12]: dtype('<M8[us]')
In [13]: In [4]: pd.DatetimeIndex([dt]).dtype
Out[13]: dtype('<M8[us]')
In [14]: In [5]: pd.Series([dt]).dtype
Out[14]: dtype('<M8[us]')

Similar when passed a sequence of np.datetime64 objects, the resolution of the passed objects will be retained (or for lower-than-second resolution, second resolution will be used).

When passing strings, the resolution will depend on the precision of the string, again matching the Timestamp behavior. Previously:

In [2]: pd.to_datetime(["2024年03月22日 11:43:01"]).dtype
Out[2]: dtype('<M8[ns]')
In [3]: pd.to_datetime(["2024年03月22日 11:43:01.002"]).dtype
Out[3]: dtype('<M8[ns]')
In [4]: pd.to_datetime(["2024年03月22日 11:43:01.002003"]).dtype
Out[4]: dtype('<M8[ns]')
In [5]: pd.to_datetime(["2024年03月22日 11:43:01.002003004"]).dtype
Out[5]: dtype('<M8[ns]')

The inferred resolution now matches that of the input strings:

In [15]: In [2]: pd.to_datetime(["2024年03月22日 11:43:01"]).dtype
Out[15]: dtype('<M8[s]')
In [16]: In [3]: pd.to_datetime(["2024年03月22日 11:43:01.002"]).dtype
Out[16]: dtype('<M8[ms]')
In [17]: In [4]: pd.to_datetime(["2024年03月22日 11:43:01.002003"]).dtype
Out[17]: dtype('<M8[us]')
In [18]: In [5]: pd.to_datetime(["2024年03月22日 11:43:01.002003004"]).dtype
Out[18]: dtype('<M8[ns]')

In cases with mixed-resolution inputs, the highest resolution is used:

In [2]: pd.to_datetime([pd.Timestamp("2024年03月22日 11:43:01"), "2024年03月22日 11:43:01.002"]).dtype
Out[2]: dtype('<M8[ns]')

Changed behavior in DataFrame.value_counts() and DataFrameGroupBy.value_counts() when sort=False#

In previous versions of pandas, DataFrame.value_counts() with sort=False would sort the result by row labels (as was documented). This was nonintuitive and inconsistent with Series.value_counts() which would maintain the order of the input. Now DataFrame.value_counts() will maintain the order of the input.

In [19]: df = pd.DataFrame(
 ....:  {
 ....:  "a": [2, 2, 2, 2, 1, 1, 1, 1],
 ....:  "b": [2, 1, 3, 1, 2, 3, 1, 1],
 ....:  }
 ....: )
 ....: 
In [20]: df
Out[20]: 
 a b
0 2 2
1 2 1
2 2 3
3 2 1
4 1 2
5 1 3
6 1 1
7 1 1

Old behavior

In [3]: df.value_counts(sort=False)
Out[3]:
a b
1 1 2
 2 1
 3 1
2 1 2
 2 1
 3 1
Name: count, dtype: int64

New behavior

In [21]: df.value_counts(sort=False)
Out[21]: 
a b
2 2 1
 1 2
 3 1
1 2 1
 3 1
 1 2
Name: count, dtype: int64

This change also applies to DataFrameGroupBy.value_counts(). Here, there are two options for sorting: one sort passed to DataFrame.groupby() and one passed directly to DataFrameGroupBy.value_counts(). The former will determine whether to sort the groups, the latter whether to sort the counts. All non-grouping columns will maintain the order of the input within groups.

Old behavior

In [5]: df.groupby("a", sort=True).value_counts(sort=False)
Out[5]:
a b
1 1 2
 2 1
 3 1
2 1 2
 2 1
 3 1
dtype: int64

New behavior

In [22]: df.groupby("a", sort=True).value_counts(sort=False)
Out[22]: 
a b
1 2 1
 3 1
 1 2
2 2 1
 3 1
 1 2
Name: count, dtype: int64

Changed behavior of pd.offsets.Day to always represent calendar-day#

In previous versions of pandas, offsets.Day represented a fixed span of 24 hours, disregarding Daylight Savings Time transitions. It now consistently behaves as a calendar-day, preserving time-of-day across DST transitions:

Old behavior

In [5]: ts = pd.Timestamp("2025年03月08日 08:00", tz="US/Eastern")
In [6]: ts + pd.offsets.Day(1)
Out[3]: Timestamp('2025年03月09日 09:00:00-0400', tz='US/Eastern')

New behavior

In [23]: ts = pd.Timestamp("2025年03月08日 08:00", tz="US/Eastern")
In [24]: ts + pd.offsets.Day(1)
Out[24]: Timestamp('2025年03月09日 08:00:00-0400', tz='US/Eastern')

This change fixes a long-standing bug in date_range() (GH 51716, GH 35388), but causes several small behavior differences as collateral:

  • pd.offsets.Day(n) no longer compares as equal to pd.offsets.Hour(24*n)

  • offsets.Day no longer supports division

  • Timedelta no longer accepts Day objects as inputs

  • tseries.frequencies.to_offset() on a Timedelta object returns a offsets.Hour object in cases where it used to return a Day object.

  • Adding or subtracting a scalar from a timezone-aware DatetimeIndex with a Day freq no longer preserves that freq attribute.

  • Adding or subtracting a Day with a Timedelta is no longer supported.

  • Adding or subtracting a Day offset to a timezone-aware Timestamp or datetime-like may lead to an ambiguous or non-existent time, which will raise.

Changed treatment of NaN values in pyarrow and numpy-nullable floating dtypes#

Previously, when dealing with a nullable dtype (e.g. Float64Dtype or int64[pyarrow]), NaN was treated as interchangeable with NA in some circumstances but not others. This was done to make adoption easier, but caused some confusion (GH 32265). In 3.0, an option "mode.nan_is_na" (default True) controls whether to treat NaN as equivalent to NA.

With pd.set_option("mode.nan_is_na", True) (again, this is the default), NaN can be passed to constructors, __setitem__, __contains__ and be treated the same as NA. The only change users will see is that arithmetic and np.ufunc operations that previously introduced NaN entries produce NA entries instead:

Old behavior:

In [2]: ser = pd.Series([0, None], dtype=pd.Float64Dtype())
In [3]: ser / 0
Out[3]:
0 NaN
1 <NA>
dtype: Float64

New behavior:

In [25]: ser = pd.Series([0, None], dtype=pd.Float64Dtype())
In [26]: ser / 0
Out[26]: 
0 <NA>
1 <NA>
dtype: Float64

By contrast, with pd.set_option("mode.nan_is_na", False), NaN is always considered distinct and specifically as a floating-point value, so cannot be used with integer dtypes:

Old behavior:

In [2]: ser = pd.Series([1, np.nan], dtype=pd.Float64Dtype())
In [3]: ser[1]
Out[3]: <NA>

New behavior:

In [27]: pd.set_option("mode.nan_is_na", False)
In [28]: ser = pd.Series([1, np.nan], dtype=pd.Float64Dtype())
In [29]: ser[1]
Out[29]: np.float64(nan)

If we had passed pd.Int64Dtype() or "int64[pyarrow]" for the dtype in the latter example, this would raise, as a float NaN cannot be held by an integer dtype.

With "mode.nan_is_na" set to False, ser.to_numpy() (and frame.values and np.asarray(obj)) will convert to object dtype if NA entries are present, where before they would coerce to NaN. To retain a float numpy dtype, explicitly pass na_value=np.nan to Series.to_numpy().

Increased minimum version for Python#

pandas 3.0.0 supports Python 3.11 and higher.

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. The following required dependencies were updated:

Package

New Minimum Version

numpy

1.26.0

tzdata

2023.3

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

New Minimum Version

adbc-driver-postgresql

1.2.0

adbc-driver-sqlite

1.2.0

mypy (dev)

1.9.0

beautifulsoup4

4.12.3

bottleneck

1.4.2

fastparquet

2024110

fsspec

2024100

hypothesis

6.116.0

gcsfs

2024100

Jinja2

3.1.5

lxml

5.3.0

Jinja2

3.1.3

matplotlib

3.9.3

numba

0.60.0

numexpr

2.10.2

qtpy

2.4.2

openpyxl

3.1.5

psycopg2

2.9.10

pyarrow

13.0.0

pymysql

1.1.1

pyreadstat

1.2.8

pytables

3.10.1

python-calamine

0.3.0

pytz

2024.2

s3fs

2024100

SciPy

1.14.1

sqlalchemy

2.0.36

xarray

2024100

xlsxwriter

3.2.0

zstandard

0.23.0

See Dependencies and Optional dependencies for more.

pytz now an optional dependency#

pandas now uses zoneinfo from the standard library as the default timezone implementation when passing a timezone string to various methods. (GH 34916)

Old behavior:

In [1]: ts = pd.Timestamp(2024, 1, 1).tz_localize("US/Pacific")
In [2]: ts.tz
<DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>

New behavior:

In [30]: ts = pd.Timestamp(2024, 1, 1).tz_localize("US/Pacific")
In [31]: ts.tz
Out[31]: zoneinfo.ZoneInfo(key='US/Pacific')

pytz timezone objects are still supported when passed directly, but they will no longer be returned by default from string inputs. Moreover, pytz is no longer a required dependency of pandas, but can be installed with the pip extra pip install pandas[timezone].

Additionally, pandas no longer throws pytz exceptions for timezone operations leading to ambiguous or nonexistent times. These cases will now raise a ValueError.

Other API changes#

  • 3rd party py.path objects are no longer explicitly supported in IO methods. Use pathlib.Path objects instead (GH 57091)

  • read_table()’s parse_dates argument defaults to None to improve consistency with read_csv() (GH 57476)

  • All classes inheriting from builtin tuple (including types created with collections.namedtuple()) are now hashed and compared as builtin tuple during indexing operations (GH 57922)

  • Made dtype a required argument in ExtensionArray._from_sequence_of_strings() (GH 56519)

  • Passing a Series input to json_normalize() will now retain the Series Index, previously output had a new RangeIndex (GH 51452)

  • Removed Index.sort() which always raised a TypeError. This attribute is not defined and will raise an AttributeError (GH 59283)

  • Unused dtype argument has been removed from the MultiIndex constructor (GH 60962)

  • Updated DataFrame.to_excel() so that the output spreadsheet has no styling. Custom styling can still be done using Styler.to_excel() (GH 54154)

  • pickle and HDF (.h5) files created with Python 2 are no longer explicitly supported (GH 57387)

  • pickled objects from pandas version less than 1.0.0 are no longer supported (GH 57155)

  • when comparing the indexes in testing.assert_series_equal(), check_exact defaults to True if an Index is of integer dtypes. (GH 57386)

  • Index set operations (like union or intersection) will now ignore the dtype of an empty RangeIndex or empty Index with object dtype when determining the dtype of the resulting Index (GH 60797)

  • IncompatibleFrequency now subclasses TypeError instead of ValueError. As a result, joins with mismatched frequencies now cast to object like other non-comparable joins, and arithmetic with indexes with mismatched frequencies align (GH 55782)

  • CategoricalIndex.append() no longer attempts to cast different-dtype indexes to the caller’s dtype (GH 41626)

  • ExtensionDtype.construct_array_type() is now a regular method instead of a classmethod (GH 58663)

  • Comparison operations between Index and Series now consistently return Series regardless of which object is on the left or right (GH 36759)

  • Numpy functions like np.isinf that return a bool dtype when called on a Index object now return a bool-dtype Index instead of np.ndarray (GH 52676)

Deprecations#

Copy keyword#

The copy keyword argument in the following methods is deprecated and will be removed in a future version:

Copy-on-Write utilizes a lazy copy mechanism that defers copying the data until necessary. Use .copy to trigger an eager copy. The copy keyword has no effect starting with 3.0, so it can be safely removed from your code.

Other Deprecations#

Removal of prior version deprecations/changes#

Enforced deprecation of aliases M, Q, Y, etc. in favour of ME, QE, YE, etc. for offsets#

Renamed the following offset aliases (GH 57986):

offset

removed alias

new alias

MonthEnd

M

ME

BusinessMonthEnd

BM

BME

SemiMonthEnd

SM

SME

CustomBusinessMonthEnd

CBM

CBME

QuarterEnd

Q

QE

BQuarterEnd

BQ

BQE

YearEnd

Y

YE

BYearEnd

BY

BYE

Other Removals#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

  • Bug in is_year_start where a DateTimeIndex constructed via a date_range with frequency ‘MS’ wouldn’t have the correct year or quarter start attributes (GH 57377)

  • Bug in DataFrame raising ValueError when dtype is timedelta64 and data is a list containing None (GH 60064)

  • Bug in Timestamp constructor failing to raise when tz=None is explicitly specified in conjunction with timezone-aware tzinfo or data (GH 48688)

  • Bug in Timestamp constructor failing to raise when given a np.datetime64 object with non-standard unit (GH 25611)

  • Bug in date_range() where the last valid timestamp would sometimes not be produced (GH 56134)

  • Bug in date_range() where using a negative frequency value would not include all points between the start and end values (GH 56147)

  • Bug in tseries.api.guess_datetime_format() would fail to infer time format when "%Y" == "%H%M" (GH 57452)

  • Bug in tseries.frequencies.to_offset() would fail to parse frequency strings starting with "LWOM" (GH 59218)

  • Bug in DataFrame.fillna() raising an AssertionError instead of OutOfBoundsDatetime when filling a datetime64[ns] column with an out-of-bounds timestamp. Now correctly raises OutOfBoundsDatetime. (GH 61208)

  • Bug in DataFrame.min() and DataFrame.max() casting datetime64 and timedelta64 columns to float64 and losing precision (GH 60850)

  • Bug in Dataframe.agg() with df with missing values resulting in IndexError (GH 58810)

  • Bug in DatetimeIndex.is_year_start() and DatetimeIndex.is_quarter_start() does not raise on Custom business days frequencies bigger then "1C" (GH 58664)

  • Bug in DatetimeIndex.is_year_start() and DatetimeIndex.is_quarter_start() returning False on double-digit frequencies (GH 58523)

  • Bug in DatetimeIndex.union() and DatetimeIndex.intersection() when unit was non-nanosecond (GH 59036)

  • Bug in Index.union() with a pyarrow timestamp dtype incorrectly returning object dtype (GH 58421)

  • Bug in Series.dt.microsecond() producing incorrect results for pyarrow backed Series. (GH 59154)

  • Bug in Timestamp.normalize() and DatetimeArray.normalize() returning incorrect results instead of raising on integer overflow for very small (distant past) values (GH 60583)

  • Bug in Timestamp.replace() failing to update unit attribute when replacement introduces non-zero nanosecond or microsecond (GH 57749)

  • Bug in to_datetime() not respecting dayfirst if an uncommon date string was passed. (GH 58859)

  • Bug in to_datetime() on float array with missing values throwing FloatingPointError (GH 58419)

  • Bug in to_datetime() on float32 df with year, month, day etc. columns leads to precision issues and incorrect result. (GH 60506)

  • Bug in to_datetime() reports incorrect index in case of any failure scenario. (GH 58298)

  • Bug in to_datetime() with format="ISO8601" and utc=True where naive timestamps incorrectly inherited timezone offset from previous timestamps in a series. (GH 61389)

  • Bug in to_datetime() wrongly converts when arg is a np.datetime64 object with unit of ps. (GH 60341)

  • Bug in comparison between objects with np.datetime64 dtype and timestamp[pyarrow] dtypes incorrectly raising TypeError (GH 60937)

  • Bug in comparison between objects with pyarrow date dtype and timestamp[pyarrow] or np.datetime64 dtype failing to consider these as non-comparable (GH 62157)

  • Bug in constructing arrays with ArrowDtype with timestamp type incorrectly allowing Decimal("NaN") (GH 61773)

  • Bug in constructing arrays with a timezone-aware ArrowDtype from timezone-naive datetime objects incorrectly treating those as UTC times instead of wall times like DatetimeTZDtype (GH 61775)

  • Bug in setting scalar values with mismatched resolution into arrays with non-nanosecond datetime64, timedelta64 or DatetimeTZDtype incorrectly truncating those scalars (GH 56410)

Timedelta#

  • Accuracy improvement in Timedelta.to_pytimedelta() to round microseconds consistently for large nanosecond based Timedelta (GH 57841)

  • Bug in Timedelta constructor failing to raise when passed an invalid keyword (GH 53801)

  • Bug in DataFrame.cumsum() which was raising IndexError if dtype is timedelta64[ns] (GH 57956)

  • Bug in multiplication operations with timedelta64 dtype failing to raise TypeError when multiplying by bool objects or dtypes (GH 58054)

Timezones#

  • Bug in DatetimeIndex.union(), DatetimeIndex.intersection(), and DatetimeIndex.symmetric_difference() changing timezone to UTC when merging two DatetimeIndex objects with the same timezone but different units (GH 60080)

  • Bug in Series.dt.tz_localize() with a timezone-aware ArrowDtype incorrectly converting to UTC when tz=None (GH 61780)

  • Fixed bug in date_range() where tz-aware endpoints with calendar offsets (e.g. "MS") failed on DST fall-back. These now respect ambiguous/ nonexistent. (GH 52908)

Numeric#

Conversion#

Strings#

Interval#

Indexing#

  • Bug in DataFrame.__getitem__() returning modified columns when called with slice in Python 3.12 (GH 57500)

  • Bug in DataFrame.__getitem__() when slicing a DataFrame with many rows raised an OverflowError (GH 59531)

  • Bug in DataFrame.from_records() throwing a ValueError when passed an empty list in index (GH 58594)

  • Bug in DataFrame.loc() and DataFrame.iloc() returning incorrect dtype when selecting from a DataFrame with mixed data types. (GH 60600)

  • Bug in DataFrame.loc() with inconsistent behavior of loc-set with 2 given indexes to Series (GH 59933)

  • Bug in Index.equals() when comparing between Series with string dtype Index (GH 61099)

  • Bug in Index.get_indexer() and similar methods when NaN is located at or after position 128 (GH 58924)

  • Bug in MultiIndex.insert() when a new value inserted to a datetime-like level gets cast to NaT and fails indexing (GH 60388)

  • Bug in Series.__setitem__() when assigning boolean series with boolean indexer will raise LossySetitemError (GH 57338)

  • Bug in printing Index.names and MultiIndex.levels would not escape single quotes (GH 60190)

  • Bug in reindexing of DataFrame with PeriodDtype columns in case of consolidated block (GH 60980, GH 60273)

  • Bug in DataFrame.loc.__getitem__() and DataFrame.iloc.__getitem__() with a CategoricalDtype column with integer categories raising when trying to index a row containing a NaN entry (GH 58954)

  • Bug in Index.__getitem__() incorrectly raising with a 0-dim np.ndarray key (GH 55601)

  • Bug in adding new rows with DataFrame.loc.__setitem__() or Series.loc.__setitem__ which failed to retain dtype on the object’s index in some cases (GH 41626)

  • Bug in indexing on a DatetimeIndex with a timestamp[pyarrow] dtype or on a TimedeltaIndex with a duration[pyarrow] dtype (GH 62277)

Missing#

MultiIndex#

I/O#

Period#

Plotting#

Groupby/resample/rolling#

  • Bug in DataFrameGroupBy.__len__() and SeriesGroupBy.__len__() would raise when the grouping contained NA values and dropna=False (GH 58644)

  • Bug in DataFrameGroupBy.any() that returned True for groups where all Timedelta values are NaT. (GH 59712)

  • Bug in DataFrameGroupBy.groups() and SeriesGroupBy.groups() would fail when the groups were Categorical with an NA value (GH 61356)

  • Bug in DataFrameGroupBy.groups() and SeriesGroupby.groups() that would not respect groupby argument dropna (GH 55919)

  • Bug in DataFrameGroupBy.median() where nat values gave an incorrect result. (GH 57926)

  • Bug in DataFrameGroupBy.quantile() when interpolation="nearest" is inconsistent with DataFrame.quantile() (GH 47942)

  • Bug in Resampler.interpolate() on a DataFrame with non-uniform sampling and/or indices not aligning with the resulting resampled index would result in wrong interpolation (GH 21351)

  • Bug in Series.rolling() when used with a BaseIndexer subclass and computing min/max (GH 46726)

  • Bug in DataFrame.ewm() and Series.ewm() when passed times and aggregation functions other than mean (GH 51695)

  • Bug in DataFrame.resample() and Series.resample() were not keeping the index name when the index had ArrowDtype timestamp dtype (GH 61222)

  • Bug in DataFrame.resample() changing index type to MultiIndex when the dataframe is empty and using an upsample method (GH 55572)

  • Bug in DataFrameGroupBy.agg() and SeriesGroupBy.agg() that was returning numpy dtype values when input values are pyarrow dtype values, instead of returning pyarrow dtype values. (GH 53030)

  • Bug in DataFrameGroupBy.agg() that raises AttributeError when there is dictionary input and duplicated columns, instead of returning a DataFrame with the aggregation of all duplicate columns. (GH 55041)

  • Bug in DataFrameGroupBy.agg() where applying a user-defined function to an empty DataFrame returned a Series instead of an empty DataFrame. (GH 61503)

  • Bug in DataFrameGroupBy.apply() and SeriesGroupBy.apply() for empty data frame with group_keys=False still creating output index using group keys. (GH 60471)

  • Bug in DataFrameGroupBy.apply() and SeriesGroupBy.apply() not preserving _metadata attributes from subclassed DataFrames and Series (GH 62134)

  • Bug in DataFrameGroupBy.apply() that was returning a completely empty DataFrame when all return values of func were None instead of returning an empty DataFrame with the original columns and dtypes. (GH 57775)

  • Bug in DataFrameGroupBy.apply() with as_index=False that was returning MultiIndex instead of returning Index. (GH 58291)

  • Bug in DataFrameGroupBy.cumsum() and DataFrameGroupBy.cumprod() where numeric_only parameter was passed indirectly through kwargs instead of passing directly. (GH 58811)

  • Bug in DataFrameGroupBy.cumsum() where it did not return the correct dtype when the label contained None. (GH 58811)

  • Bug in DataFrameGroupby.transform() and SeriesGroupby.transform() with a reducer and observed=False that coerces dtype to float when there are unobserved categories. (GH 55326)

  • Bug in Rolling.apply() for method="table" where column order was not being respected due to the columns getting sorted by default. (GH 59666)

  • Bug in Rolling.apply() where the applied function could be called on fewer than min_period periods if method="table". (GH 58868)

  • Bug in Series.resample() could raise when the date range ended shortly before a non-existent time. (GH 58380)

Reshaping#

Sparse#

ExtensionArray#

  • Bug in Categorical when constructing with an Index with ArrowDtype (GH 60563)

  • Bug in arrays.ArrowExtensionArray.__setitem__() which caused wrong behavior when using an integer array with repeated values as a key (GH 58530)

  • Bug in ArrowExtensionArray.factorize() where NA values were dropped when input was dictionary-encoded even when dropna was set to False(GH 60567)

  • Bug in api.types.is_datetime64_any_dtype() where a custom ExtensionDtype would return False for array-likes (GH 57055)

  • Bug in comparison between object with ArrowDtype and incompatible-dtyped (e.g. string vs bool) incorrectly raising instead of returning all-False (for ==) or all-True (for !=) (GH 59505)

  • Bug in constructing pandas data structures when passing into dtype a string of the type followed by [pyarrow] while PyArrow is not installed would raise NameError rather than ImportError (GH 57928)

  • Bug in various DataFrame reductions for pyarrow temporal dtypes returning incorrect dtype when result was null (GH 59234)

Styler#

  • Bug in Styler.to_latex() where styling column headers when combined with a hidden index or hidden index-levels is fixed.

Other#

Contributors#