Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

BUG: Test failures on 32-bit x86 with pyarrow installed #57523

Open
Labels
32bit32-bit systems Arrowpyarrow functionality Bug
@mgorny

Description

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

python -m pytest # ;-)

Issue Description

When running the test suite on 32-bit x86 with pyarrow installed, I'm getting the following test failures (compared to a run without pyarrow):

FAILED pandas/tests/indexes/numeric/test_indexing.py::TestGetIndexer::test_get_indexer_arrow_dictionary_target - AssertionError: numpy array are different
FAILED pandas/tests/interchange/test_impl.py::test_large_string_pyarrow - OverflowError: Python int too large to convert to C ssize_t
FAILED pandas/tests/frame/methods/test_join.py::test_join_on_single_col_dup_on_right[string[pyarrow]] - ValueError: putmask: output array is read-only
FAILED pandas/tests/reshape/merge/test_multi.py::TestMergeMulti::test_left_join_multi_index[False-True] - ValueError: putmask: output array is read-only
Tracebacks
_______________________________________ TestGetIndexer.test_get_indexer_arrow_dictionary_target _______________________________________
[gw8] linux -- Python 3.11.7 /var/tmp/portage/dev-python/pandas-2.2.0-r1/work/pandas-2.2.0-python3_11/install/usr/bin/python3.11
self = <pandas.tests.indexes.numeric.test_indexing.TestGetIndexer object at 0xd9180110>
 def test_get_indexer_arrow_dictionary_target(self):
 pa = pytest.importorskip("pyarrow")
 target = Index(
 ArrowExtensionArray(
 pa.array([1, 2], type=pa.dictionary(pa.int8(), pa.int8()))
 )
 )
 idx = Index([1])
 
 result = idx.get_indexer(target)
 expected = np.array([0, -1], dtype=np.int64)
> tm.assert_numpy_array_equal(result, expected)
E AssertionError: numpy array are different
E 
E Attribute "dtype" are different
E [left]: int32
E [right]: int64
expected = array([ 0, -1], dtype=int64)
idx = Index([1], dtype='int64')
pa = <module 'pyarrow' from '/usr/lib/python3.11/site-packages/pyarrow/__init__.py'>
result = array([ 0, -1], dtype=int32)
self = <pandas.tests.indexes.numeric.test_indexing.TestGetIndexer object at 0xd9180110>
target = Index([1, 2], dtype='dictionary<values=int8, indices=int8, ordered=0>[pyarrow]')
pandas/tests/indexes/numeric/test_indexing.py:406: AssertionError
______________________________________________________ test_large_string_pyarrow ______________________________________________________
[gw8] linux -- Python 3.11.7 /var/tmp/portage/dev-python/pandas-2.2.0-r1/work/pandas-2.2.0-python3_11/install/usr/bin/python3.11
 def test_large_string_pyarrow():
 # GH 52795
 pa = pytest.importorskip("pyarrow", "11.0.0")
 
 arr = ["Mon", "Tue"]
 table = pa.table({"weekday": pa.array(arr, "large_string")})
 exchange_df = table.__dataframe__()
 result = from_dataframe(exchange_df)
 expected = pd.DataFrame({"weekday": ["Mon", "Tue"]})
 tm.assert_frame_equal(result, expected)
 
 # check round-trip
> assert pa.Table.equals(pa.interchange.from_dataframe(result), table)
arr = ['Mon', 'Tue']
exchange_df = <pyarrow.interchange.dataframe._PyArrowDataFrame object at 0xd877c7d0>
expected = weekday
0 Mon
1 Tue
pa = <module 'pyarrow' from '/usr/lib/python3.11/site-packages/pyarrow/__init__.py'>
result = weekday
0 Mon
1 Tue
table = pyarrow.Table
weekday: large_string
----
weekday: [["Mon","Tue"]]
pandas/tests/interchange/test_impl.py:104: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:113: in from_dataframe
 return _from_dataframe(df.__dataframe__(allow_copy=allow_copy),
 allow_copy = True
 df = weekday
0 Mon
1 Tue
/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:136: in _from_dataframe
 batch = protocol_df_chunk_to_pyarrow(chunk, allow_copy)
 allow_copy = True
 batches = []
 chunk = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xccc70410>
 df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xccc70410>
/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:182: in protocol_df_chunk_to_pyarrow
 columns[name] = column_to_array(col, allow_copy)
 allow_copy = True
 col = <pandas.core.interchange.column.PandasColumn object at 0xccc703d0>
 columns = {}
 df = <pandas.core.interchange.dataframe.PandasDataFrameXchg object at 0xccc70410>
 dtype = <DtypeKind.STRING: 21>
 name = 'weekday'
/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:214: in column_to_array
 data = buffers_to_array(buffers, data_type,
 allow_copy = True
 buffers = {'data': (PandasBuffer({'bufsize': 6, 'ptr': 3445199088, 'device': 'CPU'}),
 (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 24, 'ptr': 1546049072, 'device': 'CPU'}),
 (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 2, 'ptr': 1544334624, 'device': 'CPU'}),
 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
 col = <pandas.core.interchange.column.PandasColumn object at 0xccc703d0>
 data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
/usr/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:396: in buffers_to_array
 data_pa_buffer = pa.foreign_buffer(data_buff.ptr, data_buff.bufsize,
 _ = (<DtypeKind.STRING: 21>, 8, 'u', '=')
 allow_copy = True
 buffers = {'data': (PandasBuffer({'bufsize': 6, 'ptr': 3445199088, 'device': 'CPU'}),
 (<DtypeKind.STRING: 21>, 8, 'u', '=')),
 'offsets': (PandasBuffer({'bufsize': 24, 'ptr': 1546049072, 'device': 'CPU'}),
 (<DtypeKind.INT: 0>, 64, 'l', '=')),
 'validity': (PandasBuffer({'bufsize': 2, 'ptr': 1544334624, 'device': 'CPU'}),
 (<DtypeKind.BOOL: 20>, 8, 'b', '='))}
 data_buff = PandasBuffer({'bufsize': 6, 'ptr': 3445199088, 'device': 'CPU'})
 data_type = (<DtypeKind.STRING: 21>, 8, 'u', '=')
 describe_null = (<ColumnNullType.USE_BYTEMASK: 4>, 0)
 length = 2
 offset = 0
 offset_buff = PandasBuffer({'bufsize': 24, 'ptr': 1546049072, 'device': 'CPU'})
 offset_dtype = (<DtypeKind.INT: 0>, 64, 'l', '=')
 validity_buff = PandasBuffer({'bufsize': 2, 'ptr': 1544334624, 'device': 'CPU'})
 validity_dtype = (<DtypeKind.BOOL: 20>, 8, 'b', '=')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E OverflowError: Python int too large to convert to C ssize_t
pyarrow/io.pxi:1990: OverflowError
________________________________________ test_join_on_single_col_dup_on_right[string[pyarrow]] ________________________________________
[gw5] linux -- Python 3.11.7 /var/tmp/portage/dev-python/pandas-2.2.0-r1/work/pandas-2.2.0-python3_11/install/usr/bin/python3.11
left_no_dup = a b
0 a cat
1 b dog
2 c weasel
3 d horse
right_w_dups = c
a 
<NA> meow
<NA> bark
<NA> um... weasel noise?
<NA> nay
<NA> chirp
e moo
dtype = 'string[pyarrow]'
 @pytest.mark.parametrize("dtype", ["object", "string[pyarrow]"])
 def test_join_on_single_col_dup_on_right(left_no_dup, right_w_dups, dtype):
 # GH 46622
 # Dups on right allowed by one_to_many constraint
 if dtype == "string[pyarrow]":
 pytest.importorskip("pyarrow")
 left_no_dup = left_no_dup.astype(dtype)
 right_w_dups.index = right_w_dups.index.astype(dtype)
> left_no_dup.join(
 right_w_dups,
 on="a",
 validate="one_to_many",
 )
dtype = 'string[pyarrow]'
left_no_dup = a b
0 a cat
1 b dog
2 c weasel
3 d horse
right_w_dups = c
a 
<NA> meow
<NA> bark
<NA> um... weasel noise?
<NA> nay
<NA> chirp
e moo
pandas/tests/frame/methods/test_join.py:169: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/frame.py:10730: in join
 return merge(
 concat = <function concat at 0xec6cdc58>
 how = 'left'
 lsuffix = ''
 merge = <function merge at 0xec226e88>
 on = 'a'
 other = c
a 
<NA> meow
<NA> bark
<NA> um... weasel noise?
<NA> nay
<NA> chirp
e moo
 rsuffix = ''
 self = a b
0 a cat
1 b dog
2 c weasel
3 d horse
 sort = False
 validate = 'one_to_many'
pandas/core/reshape/merge.py:184: in merge
 return op.get_result(copy=copy)
 copy = None
 how = 'left'
 indicator = False
 left = a b
0 a cat
1 b dog
2 c weasel
3 d horse
 left_df = a b
0 a cat
1 b dog
2 c weasel
3 d horse
 left_index = False
 left_on = 'a'
 on = None
 op = <pandas.core.reshape.merge._MergeOperation object at 0xcbf60b30>
 right = c
a 
<NA> meow
<NA> bark
<NA> um... weasel noise?
<NA> nay
<NA> chirp
e moo
 right_df = c
a 
<NA> meow
<NA> bark
<NA> um... weasel noise?
<NA> nay
<NA> chirp
e moo
 right_index = True
 right_on = None
 sort = False
 suffixes = ('', '')
 validate = 'one_to_many'
pandas/core/reshape/merge.py:886: in get_result
 join_index, left_indexer, right_indexer = self._get_join_info()
 copy = None
 self = <pandas.core.reshape.merge._MergeOperation object at 0xcbf60b30>
pandas/core/reshape/merge.py:1142: in _get_join_info
 join_index, left_indexer, right_indexer = _left_join_on_index(
 left_ax = RangeIndex(start=0, stop=4, step=1)
 right_ax = Index([<NA>, <NA>, <NA>, <NA>, <NA>, 'e'], dtype='string', name='a')
 self = <pandas.core.reshape.merge._MergeOperation object at 0xcbf60b30>
pandas/core/reshape/merge.py:2385: in _left_join_on_index
 left_key, right_key, count = _factorize_keys(lkey, rkey, sort=sort)
 join_keys = [<ArrowStringArray>
['a', 'b', 'c', 'd']
Length: 4, dtype: string]
 left_ax = RangeIndex(start=0, stop=4, step=1)
 lkey = <ArrowStringArray>
['a', 'b', 'c', 'd']
Length: 4, dtype: string
 right_ax = Index([<NA>, <NA>, <NA>, <NA>, <NA>, 'e'], dtype='string', name='a')
 rkey = <ArrowStringArray>
[<NA>, <NA>, <NA>, <NA>, <NA>, 'e']
Length: 6, dtype: string
 sort = False
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
lk = <pyarrow.lib.ChunkedArray object at 0xcba48fa0>
[
 [
 "a",
 "b",
 "c",
 "d"
 ]
]
rk = <pyarrow.lib.ChunkedArray object at 0xcbf280f0>
[
 [
 null,
 null,
 null,
 null,
 null,
 "e"
 ]
]
sort = False
 def _factorize_keys(
 lk: ArrayLike, rk: ArrayLike, sort: bool = True
 ) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]:
 """
 Encode left and right keys as enumerated types.
 
 This is used to get the join indexers to be used when merging DataFrames.
 
 Parameters
 ----------
 lk : ndarray, ExtensionArray
 Left key.
 rk : ndarray, ExtensionArray
 Right key.
 sort : bool, defaults to True
 If True, the encoding is done such that the unique elements in the
 keys are sorted.
 
 Returns
 -------
 np.ndarray[np.intp]
 Left (resp. right if called with `key='right'`) labels, as enumerated type.
 np.ndarray[np.intp]
 Right (resp. left if called with `key='right'`) labels, as enumerated type.
 int
 Number of unique elements in union of left and right labels.
 
 See Also
 --------
 merge : Merge DataFrame or named Series objects
 with a database-style join.
 algorithms.factorize : Encode the object as an enumerated type
 or categorical variable.
 
 Examples
 --------
 >>> lk = np.array(["a", "c", "b"])
 >>> rk = np.array(["a", "c"])
 
 Here, the unique values are `'a', 'b', 'c'`. With the default
 `sort=True`, the encoding will be `{0: 'a', 1: 'b', 2: 'c'}`:
 
 >>> pd.core.reshape.merge._factorize_keys(lk, rk)
 (array([0, 2, 1]), array([0, 2]), 3)
 
 With the `sort=False`, the encoding will correspond to the order
 in which the unique elements first appear: `{0: 'a', 1: 'c', 2: 'b'}`:
 
 >>> pd.core.reshape.merge._factorize_keys(lk, rk, sort=False)
 (array([0, 1, 2]), array([0, 1]), 3)
 """
 # TODO: if either is a RangeIndex, we can likely factorize more efficiently?
 
 if (
 isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype)
 ) or (lib.is_np_dtype(lk.dtype, "M") and lib.is_np_dtype(rk.dtype, "M")):
 # Extract the ndarray (UTC-localized) values
 # Note: we dont need the dtypes to match, as these can still be compared
 lk, rk = cast("DatetimeArray", lk)._ensure_matching_resos(rk)
 lk = cast("DatetimeArray", lk)._ndarray
 rk = cast("DatetimeArray", rk)._ndarray
 
 elif (
 isinstance(lk.dtype, CategoricalDtype)
 and isinstance(rk.dtype, CategoricalDtype)
 and lk.dtype == rk.dtype
 ):
 assert isinstance(lk, Categorical)
 assert isinstance(rk, Categorical)
 # Cast rk to encoding so we can compare codes with lk
 
 rk = lk._encode_with_my_categories(rk)
 
 lk = ensure_int64(lk.codes)
 rk = ensure_int64(rk.codes)
 
 elif isinstance(lk, ExtensionArray) and lk.dtype == rk.dtype:
 if (isinstance(lk.dtype, ArrowDtype) and is_string_dtype(lk.dtype)) or (
 isinstance(lk.dtype, StringDtype)
 and lk.dtype.storage in ["pyarrow", "pyarrow_numpy"]
 ):
 import pyarrow as pa
 import pyarrow.compute as pc
 
 len_lk = len(lk)
 lk = lk._pa_array # type: ignore[attr-defined]
 rk = rk._pa_array # type: ignore[union-attr]
 dc = (
 pa.chunked_array(lk.chunks + rk.chunks) # type: ignore[union-attr]
 .combine_chunks()
 .dictionary_encode()
 )
 
 llab, rlab, count = (
 pc.fill_null(dc.indices[slice(len_lk)], -1)
 .to_numpy()
 .astype(np.intp, copy=False),
 pc.fill_null(dc.indices[slice(len_lk, None)], -1)
 .to_numpy()
 .astype(np.intp, copy=False),
 len(dc.dictionary),
 )
 
 if sort:
 uniques = dc.dictionary.to_numpy(zero_copy_only=False)
 llab, rlab = _sort_labels(uniques, llab, rlab)
 
 if dc.null_count > 0:
 lmask = llab == -1
 lany = lmask.any()
 rmask = rlab == -1
 rany = rmask.any()
 if lany:
 np.putmask(llab, lmask, count)
 if rany:
> np.putmask(rlab, rmask, count)
E ValueError: putmask: output array is read-only
count = 5
dc = <pyarrow.lib.DictionaryArray object at 0xcb83ffb0>
-- dictionary:
 [
 "a",
 "b",
 "c",
 "d",
 "e"
 ]
-- indices:
 [
 0,
 1,
 2,
 3,
 null,
 null,
 null,
 null,
 null,
 4
 ]
lany = False
len_lk = 4
lk = <pyarrow.lib.ChunkedArray object at 0xcba48fa0>
[
 [
 "a",
 "b",
 "c",
 "d"
 ]
]
llab = array([0, 1, 2, 3])
lmask = array([False, False, False, False])
pa = <module 'pyarrow' from '/usr/lib/python3.11/site-packages/pyarrow/__init__.py'>
pc = <module 'pyarrow.compute' from '/usr/lib/python3.11/site-packages/pyarrow/compute.py'>
rany = True
rk = <pyarrow.lib.ChunkedArray object at 0xcbf280f0>
[
 [
 null,
 null,
 null,
 null,
 null,
 "e"
 ]
]
rlab = array([-1, -1, -1, -1, -1, 4])
rmask = array([ True, True, True, True, True, False])
sort = False
pandas/core/reshape/merge.py:2514: ValueError
________________________________________ TestMergeMulti.test_left_join_multi_index[False-True] ________________________________________
[gw3] linux -- Python 3.11.7 /var/tmp/portage/dev-python/pandas-2.2.0-r1/work/pandas-2.2.0-python3_11/install/usr/bin/python3.11
self = <pandas.tests.reshape.merge.test_multi.TestMergeMulti object at 0xd20d8ff0>, sort = False, infer_string = True
 @pytest.mark.parametrize(
 "infer_string", [False, pytest.param(True, marks=td.skip_if_no("pyarrow"))]
 )
 @pytest.mark.parametrize("sort", [True, False])
 def test_left_join_multi_index(self, sort, infer_string):
 with option_context("future.infer_string", infer_string):
 icols = ["1st", "2nd", "3rd"]
 
 def bind_cols(df):
 iord = lambda a: 0 if a != a else ord(a)
 f = lambda ts: ts.map(iord) - ord("a")
 return f(df["1st"]) + f(df["3rd"]) * 1e2 + df["2nd"].fillna(0) * 10
 
 def run_asserts(left, right, sort):
 res = left.join(right, on=icols, how="left", sort=sort)
 
 assert len(left) < len(res) + 1
 assert not res["4th"].isna().any()
 assert not res["5th"].isna().any()
 
 tm.assert_series_equal(res["4th"], -res["5th"], check_names=False)
 result = bind_cols(res.iloc[:, :-2])
 tm.assert_series_equal(res["4th"], result, check_names=False)
 assert result.name is None
 
 if sort:
 tm.assert_frame_equal(res, res.sort_values(icols, kind="mergesort"))
 
 out = merge(left, right.reset_index(), on=icols, sort=sort, how="left")
 
 res.index = RangeIndex(len(res))
 tm.assert_frame_equal(out, res)
 
 lc = list(map(chr, np.arange(ord("a"), ord("z") + 1)))
 left = DataFrame(
 np.random.default_rng(2).choice(lc, (50, 2)), columns=["1st", "3rd"]
 )
 # Explicit cast to float to avoid implicit cast when setting nan
 left.insert(
 1,
 "2nd",
 np.random.default_rng(2).integers(0, 10, len(left)).astype("float"),
 )
 
 i = np.random.default_rng(2).permutation(len(left))
 right = left.iloc[i].copy()
 
 left["4th"] = bind_cols(left)
 right["5th"] = -bind_cols(right)
 right.set_index(icols, inplace=True)
 
 run_asserts(left, right, sort)
 
 # inject some nulls
 left.loc[1::4, "1st"] = np.nan
 left.loc[2::5, "2nd"] = np.nan
 left.loc[3::6, "3rd"] = np.nan
 left["4th"] = bind_cols(left)
 
 i = np.random.default_rng(2).permutation(len(left))
 right = left.iloc[i, :-1]
 right["5th"] = -bind_cols(right)
 right.set_index(icols, inplace=True)
 
> run_asserts(left, right, sort)
bind_cols = <function TestMergeMulti.test_left_join_multi_index.<locals>.bind_cols at 0xc9928b18>
i = array([ 6, 40, 33, 38, 7, 46, 28, 45, 5, 34, 12, 18, 27, 3, 9, 39, 42,
 23, 0, 26, 4, 10, 14, 41, 16, 43, 15, 48, 13, 24, 20, 25, 22, 49,
 2, 11, 32, 44, 47, 17, 19, 37, 21, 29, 31, 30, 35, 36, 8, 1])
icols = ['1st', '2nd', '3rd']
infer_string = True
lc = ['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']
left = 1st 2nd 3rd 4th
0 v 8.0 g 701.0
1 NaN 2.0 h 623.0
2 k NaN v 2110.0
3 l 2.0 NaN -9669.0
4 i 4.0 p 1548.0
5 NaN 8.0 s 1783.0
6 z 4.0 e 465.0
7 w NaN b 122.0
8 o 3.0 h 744.0
9 NaN 6.0 NaN -9737.0
10 h 8.0 o 1487.0
11 g 7.0 d 376.0
12 t NaN l 1119.0
13 NaN 1.0 r 1613.0
14 y 8.0 k 1104.0
15 f 0.0 NaN -9695.0
16 y 5.0 z 2574.0
17 NaN NaN r 1603.0
18 j 2.0 k 1029.0
19 b 6.0 e 461.0
20 i 3.0 i 838.0
21 NaN 5.0 NaN -9747.0
22 s NaN x 2318.0
23 w 1.0 u 2032.0
24 z 7.0 i 895.0
25 NaN 4.0 y 2343.0
26 f 6.0 m 1265.0
27 o NaN NaN -9686.0
28 s 9.0 c 308.0
29 NaN 4.0 c 143.0
30 y 2.0 f 544.0
31 l 6.0 w 2271.0
32 n NaN r 1713.0
33 NaN 9.0 NaN -9707.0
34 p 8.0 q 1695.0
35 l 6.0 k 1071.0
36 p 3.0 n 1345.0
37 NaN NaN p 1403.0
38 m 0.0 w 2212.0
39 f 1.0 NaN -9685.0
40 m 3.0 x 2342.0
41 NaN 3.0 p 1433.0
42 b NaN v 2101.0
43 l 5.0 m 1261.0
44 c 6.0 s 1862.0
45 NaN 8.0 NaN -9717.0
46 t 8.0 n 1399.0
47 b NaN f 501.0
48 g 9.0 c 296.0
49 NaN 3.0 b 33.0
right = 5th
1st 2nd 3rd 
z 4.0 e -465.0
m 3.0 x -2342.0
nan 9.0 nan 9707.0
m 0.0 w -2212.0
w NaN b -122.0
t 8.0 n -1399.0
s 9.0 c -308.0
nan 8.0 nan 9717.0
 s -1783.0
p 8.0 q -1695.0
t NaN l -1119.0
j 2.0 k -1029.0
o NaN nan 9686.0
l 2.0 nan 9669.0
nan 6.0 nan 9737.0
f 1.0 nan 9685.0
b NaN v -2101.0
w 1.0 u -2032.0
v 8.0 g -701.0
f 6.0 m -1265.0
i 4.0 p -1548.0
h 8.0 o -1487.0
y 8.0 k -1104.0
nan 3.0 p -1433.0
y 5.0 z -2574.0
l 5.0 m -1261.0
f 0.0 nan 9695.0
g 9.0 c -296.0
nan 1.0 r -1613.0
z 7.0 i -895.0
i 3.0 i -838.0
nan 4.0 y -2343.0
s NaN x -2318.0
nan 3.0 b -33.0
k NaN v -2110.0
g 7.0 d -376.0
n NaN r -1713.0
c 6.0 s -1862.0
b NaN f -501.0
nan NaN r -1603.0
b 6.0 e -461.0
nan NaN p -1403.0
 5.0 nan 9747.0
 4.0 c -143.0
l 6.0 w -2271.0
y 2.0 f -544.0
l 6.0 k -1071.0
p 3.0 n -1345.0
o 3.0 h -744.0
nan 2.0 h -623.0
run_asserts = <function TestMergeMulti.test_left_join_multi_index.<locals>.run_asserts at 0xc9928bb8>
self = <pandas.tests.reshape.merge.test_multi.TestMergeMulti object at 0xd20d8ff0>
sort = False
pandas/tests/reshape/merge/test_multi.py:158: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/tests/reshape/merge/test_multi.py:108: in run_asserts
 res = left.join(right, on=icols, how="left", sort=sort)
 bind_cols = <function TestMergeMulti.test_left_join_multi_index.<locals>.bind_cols at 0xc9928b18>
 icols = ['1st', '2nd', '3rd']
 left = 1st 2nd 3rd 4th
0 v 8.0 g 701.0
1 NaN 2.0 h 623.0
2 k NaN v 2110.0
3 l 2.0 NaN -9669.0
4 i 4.0 p 1548.0
5 NaN 8.0 s 1783.0
6 z 4.0 e 465.0
7 w NaN b 122.0
8 o 3.0 h 744.0
9 NaN 6.0 NaN -9737.0
10 h 8.0 o 1487.0
11 g 7.0 d 376.0
12 t NaN l 1119.0
13 NaN 1.0 r 1613.0
14 y 8.0 k 1104.0
15 f 0.0 NaN -9695.0
16 y 5.0 z 2574.0
17 NaN NaN r 1603.0
18 j 2.0 k 1029.0
19 b 6.0 e 461.0
20 i 3.0 i 838.0
21 NaN 5.0 NaN -9747.0
22 s NaN x 2318.0
23 w 1.0 u 2032.0
24 z 7.0 i 895.0
25 NaN 4.0 y 2343.0
26 f 6.0 m 1265.0
27 o NaN NaN -9686.0
28 s 9.0 c 308.0
29 NaN 4.0 c 143.0
30 y 2.0 f 544.0
31 l 6.0 w 2271.0
32 n NaN r 1713.0
33 NaN 9.0 NaN -9707.0
34 p 8.0 q 1695.0
35 l 6.0 k 1071.0
36 p 3.0 n 1345.0
37 NaN NaN p 1403.0
38 m 0.0 w 2212.0
39 f 1.0 NaN -9685.0
40 m 3.0 x 2342.0
41 NaN 3.0 p 1433.0
42 b NaN v 2101.0
43 l 5.0 m 1261.0
44 c 6.0 s 1862.0
45 NaN 8.0 NaN -9717.0
46 t 8.0 n 1399.0
47 b NaN f 501.0
48 g 9.0 c 296.0
49 NaN 3.0 b 33.0
 right = 5th
1st 2nd 3rd 
z 4.0 e -465.0
m 3.0 x -2342.0
nan 9.0 nan 9707.0
m 0.0 w -2212.0
w NaN b -122.0
t 8.0 n -1399.0
s 9.0 c -308.0
nan 8.0 nan 9717.0
 s -1783.0
p 8.0 q -1695.0
t NaN l -1119.0
j 2.0 k -1029.0
o NaN nan 9686.0
l 2.0 nan 9669.0
nan 6.0 nan 9737.0
f 1.0 nan 9685.0
b NaN v -2101.0
w 1.0 u -2032.0
v 8.0 g -701.0
f 6.0 m -1265.0
i 4.0 p -1548.0
h 8.0 o -1487.0
y 8.0 k -1104.0
nan 3.0 p -1433.0
y 5.0 z -2574.0
l 5.0 m -1261.0
f 0.0 nan 9695.0
g 9.0 c -296.0
nan 1.0 r -1613.0
z 7.0 i -895.0
i 3.0 i -838.0
nan 4.0 y -2343.0
s NaN x -2318.0
nan 3.0 b -33.0
k NaN v -2110.0
g 7.0 d -376.0
n NaN r -1713.0
c 6.0 s -1862.0
b NaN f -501.0
nan NaN r -1603.0
b 6.0 e -461.0
nan NaN p -1403.0
 5.0 nan 9747.0
 4.0 c -143.0
l 6.0 w -2271.0
y 2.0 f -544.0
l 6.0 k -1071.0
p 3.0 n -1345.0
o 3.0 h -744.0
nan 2.0 h -623.0
 sort = False
pandas/core/frame.py:10730: in join
 return merge(
 concat = <function concat at 0xec6bec58>
 how = 'left'
 lsuffix = ''
 merge = <function merge at 0xec217e88>
 on = ['1st', '2nd', '3rd']
 other = 5th
1st 2nd 3rd 
z 4.0 e -465.0
m 3.0 x -2342.0
nan 9.0 nan 9707.0
m 0.0 w -2212.0
w NaN b -122.0
t 8.0 n -1399.0
s 9.0 c -308.0
nan 8.0 nan 9717.0
 s -1783.0
p 8.0 q -1695.0
t NaN l -1119.0
j 2.0 k -1029.0
o NaN nan 9686.0
l 2.0 nan 9669.0
nan 6.0 nan 9737.0
f 1.0 nan 9685.0
b NaN v -2101.0
w 1.0 u -2032.0
v 8.0 g -701.0
f 6.0 m -1265.0
i 4.0 p -1548.0
h 8.0 o -1487.0
y 8.0 k -1104.0
nan 3.0 p -1433.0
y 5.0 z -2574.0
l 5.0 m -1261.0
f 0.0 nan 9695.0
g 9.0 c -296.0
nan 1.0 r -1613.0
z 7.0 i -895.0
i 3.0 i -838.0
nan 4.0 y -2343.0
s NaN x -2318.0
nan 3.0 b -33.0
k NaN v -2110.0
g 7.0 d -376.0
n NaN r -1713.0
c 6.0 s -1862.0
b NaN f -501.0
nan NaN r -1603.0
b 6.0 e -461.0
nan NaN p -1403.0
 5.0 nan 9747.0
 4.0 c -143.0
l 6.0 w -2271.0
y 2.0 f -544.0
l 6.0 k -1071.0
p 3.0 n -1345.0
o 3.0 h -744.0
nan 2.0 h -623.0
 rsuffix = ''
 self = 1st 2nd 3rd 4th
0 v 8.0 g 701.0
1 NaN 2.0 h 623.0
2 k NaN v 2110.0
3 l 2.0 NaN -9669.0
4 i 4.0 p 1548.0
5 NaN 8.0 s 1783.0
6 z 4.0 e 465.0
7 w NaN b 122.0
8 o 3.0 h 744.0
9 NaN 6.0 NaN -9737.0
10 h 8.0 o 1487.0
11 g 7.0 d 376.0
12 t NaN l 1119.0
13 NaN 1.0 r 1613.0
14 y 8.0 k 1104.0
15 f 0.0 NaN -9695.0
16 y 5.0 z 2574.0
17 NaN NaN r 1603.0
18 j 2.0 k 1029.0
19 b 6.0 e 461.0
20 i 3.0 i 838.0
21 NaN 5.0 NaN -9747.0
22 s NaN x 2318.0
23 w 1.0 u 2032.0
24 z 7.0 i 895.0
25 NaN 4.0 y 2343.0
26 f 6.0 m 1265.0
27 o NaN NaN -9686.0
28 s 9.0 c 308.0
29 NaN 4.0 c 143.0
30 y 2.0 f 544.0
31 l 6.0 w 2271.0
32 n NaN r 1713.0
33 NaN 9.0 NaN -9707.0
34 p 8.0 q 1695.0
35 l 6.0 k 1071.0
36 p 3.0 n 1345.0
37 NaN NaN p 1403.0
38 m 0.0 w 2212.0
39 f 1.0 NaN -9685.0
40 m 3.0 x 2342.0
41 NaN 3.0 p 1433.0
42 b NaN v 2101.0
43 l 5.0 m 1261.0
44 c 6.0 s 1862.0
45 NaN 8.0 NaN -9717.0
46 t 8.0 n 1399.0
47 b NaN f 501.0
48 g 9.0 c 296.0
49 NaN 3.0 b 33.0
 sort = False
 validate = None
pandas/core/reshape/merge.py:184: in merge
 return op.get_result(copy=copy)
 copy = None
 how = 'left'
 indicator = False
 left = 1st 2nd 3rd 4th
0 v 8.0 g 701.0
1 NaN 2.0 h 623.0
2 k NaN v 2110.0
3 l 2.0 NaN -9669.0
4 i 4.0 p 1548.0
5 NaN 8.0 s 1783.0
6 z 4.0 e 465.0
7 w NaN b 122.0
8 o 3.0 h 744.0
9 NaN 6.0 NaN -9737.0
10 h 8.0 o 1487.0
11 g 7.0 d 376.0
12 t NaN l 1119.0
13 NaN 1.0 r 1613.0
14 y 8.0 k 1104.0
15 f 0.0 NaN -9695.0
16 y 5.0 z 2574.0
17 NaN NaN r 1603.0
18 j 2.0 k 1029.0
19 b 6.0 e 461.0
20 i 3.0 i 838.0
21 NaN 5.0 NaN -9747.0
22 s NaN x 2318.0
23 w 1.0 u 2032.0
24 z 7.0 i 895.0
25 NaN 4.0 y 2343.0
26 f 6.0 m 1265.0
27 o NaN NaN -9686.0
28 s 9.0 c 308.0
29 NaN 4.0 c 143.0
30 y 2.0 f 544.0
31 l 6.0 w 2271.0
32 n NaN r 1713.0
33 NaN 9.0 NaN -9707.0
34 p 8.0 q 1695.0
35 l 6.0 k 1071.0
36 p 3.0 n 1345.0
37 NaN NaN p 1403.0
38 m 0.0 w 2212.0
39 f 1.0 NaN -9685.0
40 m 3.0 x 2342.0
41 NaN 3.0 p 1433.0
42 b NaN v 2101.0
43 l 5.0 m 1261.0
44 c 6.0 s 1862.0
45 NaN 8.0 NaN -9717.0
46 t 8.0 n 1399.0
47 b NaN f 501.0
48 g 9.0 c 296.0
49 NaN 3.0 b 33.0
 left_df = 1st 2nd 3rd 4th
0 v 8.0 g 701.0
1 NaN 2.0 h 623.0
2 k NaN v 2110.0
3 l 2.0 NaN -9669.0
4 i 4.0 p 1548.0
5 NaN 8.0 s 1783.0
6 z 4.0 e 465.0
7 w NaN b 122.0
8 o 3.0 h 744.0
9 NaN 6.0 NaN -9737.0
10 h 8.0 o 1487.0
11 g 7.0 d 376.0
12 t NaN l 1119.0
13 NaN 1.0 r 1613.0
14 y 8.0 k 1104.0
15 f 0.0 NaN -9695.0
16 y 5.0 z 2574.0
17 NaN NaN r 1603.0
18 j 2.0 k 1029.0
19 b 6.0 e 461.0
20 i 3.0 i 838.0
21 NaN 5.0 NaN -9747.0
22 s NaN x 2318.0
23 w 1.0 u 2032.0
24 z 7.0 i 895.0
25 NaN 4.0 y 2343.0
26 f 6.0 m 1265.0
27 o NaN NaN -9686.0
28 s 9.0 c 308.0
29 NaN 4.0 c 143.0
30 y 2.0 f 544.0
31 l 6.0 w 2271.0
32 n NaN r 1713.0
33 NaN 9.0 NaN -9707.0
34 p 8.0 q 1695.0
35 l 6.0 k 1071.0
36 p 3.0 n 1345.0
37 NaN NaN p 1403.0
38 m 0.0 w 2212.0
39 f 1.0 NaN -9685.0
40 m 3.0 x 2342.0
41 NaN 3.0 p 1433.0
42 b NaN v 2101.0
43 l 5.0 m 1261.0
44 c 6.0 s 1862.0
45 NaN 8.0 NaN -9717.0
46 t 8.0 n 1399.0
47 b NaN f 501.0
48 g 9.0 c 296.0
49 NaN 3.0 b 33.0
 left_index = False
 left_on = ['1st', '2nd', '3rd']
 on = None
 op = <pandas.core.reshape.merge._MergeOperation object at 0xc9857b30>
 right = 5th
1st 2nd 3rd 
z 4.0 e -465.0
m 3.0 x -2342.0
nan 9.0 nan 9707.0
m 0.0 w -2212.0
w NaN b -122.0
t 8.0 n -1399.0
s 9.0 c -308.0
nan 8.0 nan 9717.0
 s -1783.0
p 8.0 q -1695.0
t NaN l -1119.0
j 2.0 k -1029.0
o NaN nan 9686.0
l 2.0 nan 9669.0
nan 6.0 nan 9737.0
f 1.0 nan 9685.0
b NaN v -2101.0
w 1.0 u -2032.0
v 8.0 g -701.0
f 6.0 m -1265.0
i 4.0 p -1548.0
h 8.0 o -1487.0
y 8.0 k -1104.0
nan 3.0 p -1433.0
y 5.0 z -2574.0
l 5.0 m -1261.0
f 0.0 nan 9695.0
g 9.0 c -296.0
nan 1.0 r -1613.0
z 7.0 i -895.0
i 3.0 i -838.0
nan 4.0 y -2343.0
s NaN x -2318.0
nan 3.0 b -33.0
k NaN v -2110.0
g 7.0 d -376.0
n NaN r -1713.0
c 6.0 s -1862.0
b NaN f -501.0
nan NaN r -1603.0
b 6.0 e -461.0
nan NaN p -1403.0
 5.0 nan 9747.0
 4.0 c -143.0
l 6.0 w -2271.0
y 2.0 f -544.0
l 6.0 k -1071.0
p 3.0 n -1345.0
o 3.0 h -744.0
nan 2.0 h -623.0
 right_df = 5th
1st 2nd 3rd 
z 4.0 e -465.0
m 3.0 x -2342.0
nan 9.0 nan 9707.0
m 0.0 w -2212.0
w NaN b -122.0
t 8.0 n -1399.0
s 9.0 c -308.0
nan 8.0 nan 9717.0
 s -1783.0
p 8.0 q -1695.0
t NaN l -1119.0
j 2.0 k -1029.0
o NaN nan 9686.0
l 2.0 nan 9669.0
nan 6.0 nan 9737.0
f 1.0 nan 9685.0
b NaN v -2101.0
w 1.0 u -2032.0
v 8.0 g -701.0
f 6.0 m -1265.0
i 4.0 p -1548.0
h 8.0 o -1487.0
y 8.0 k -1104.0
nan 3.0 p -1433.0
y 5.0 z -2574.0
l 5.0 m -1261.0
f 0.0 nan 9695.0
g 9.0 c -296.0
nan 1.0 r -1613.0
z 7.0 i -895.0
i 3.0 i -838.0
nan 4.0 y -2343.0
s NaN x -2318.0
nan 3.0 b -33.0
k NaN v -2110.0
g 7.0 d -376.0
n NaN r -1713.0
c 6.0 s -1862.0
b NaN f -501.0
nan NaN r -1603.0
b 6.0 e -461.0
nan NaN p -1403.0
 5.0 nan 9747.0
 4.0 c -143.0
l 6.0 w -2271.0
y 2.0 f -544.0
l 6.0 k -1071.0
p 3.0 n -1345.0
o 3.0 h -744.0
nan 2.0 h -623.0
 right_index = True
 right_on = None
 sort = False
 suffixes = ('', '')
 validate = None
pandas/core/reshape/merge.py:886: in get_result
 join_index, left_indexer, right_indexer = self._get_join_info()
 copy = None
 self = <pandas.core.reshape.merge._MergeOperation object at 0xc9857b30>
pandas/core/reshape/merge.py:1142: in _get_join_info
 join_index, left_indexer, right_indexer = _left_join_on_index(
 left_ax = RangeIndex(start=0, stop=50, step=1)
 right_ax = MultiIndex([('z', 4.0, 'e'),
 ('m', 3.0, 'x'),
 (nan, 9.0, nan),
 ('m', 0.0, 'w'),
 ('w', nan, 'b'),
 ('t', 8.0, 'n'),
 ('s', 9.0, 'c'),
 (nan, 8.0, nan),
 (nan, 8.0, 's'),
 ('p', 8.0, 'q'),
 ('t', nan, 'l'),
 ('j', 2.0, 'k'),
 ('o', nan, nan),
 ('l', 2.0, nan),
 (nan, 6.0, nan),
 ('f', 1.0, nan),
 ('b', nan, 'v'),
 ('w', 1.0, 'u'),
 ('v', 8.0, 'g'),
 ('f', 6.0, 'm'),
 ('i', 4.0, 'p'),
 ('h', 8.0, 'o'),
 ('y', 8.0, 'k'),
 (nan, 3.0, 'p'),
 ('y', 5.0, 'z'),
 ('l', 5.0, 'm'),
 ('f', 0.0, nan),
 ('g', 9.0, 'c'),
 (nan, 1.0, 'r'),
 ('z', 7.0, 'i'),
 ('i', 3.0, 'i'),
 (nan, 4.0, 'y'),
 ('s', nan, 'x'),
 (nan, 3.0, 'b'),
 ('k', nan, 'v'),
 ('g', 7.0, 'd'),
 ('n', nan, 'r'),
 ('c', 6.0, 's'),
 ('b', nan, 'f'),
 (nan, nan, 'r'),
 ('b', 6.0, 'e'),
 (nan, nan, 'p'),
 (nan, 5.0, nan),
 (nan, 4.0, 'c'),
 ('l', 6.0, 'w'),
 ('y', 2.0, 'f'),
 ('l', 6.0, 'k'),
 ('p', 3.0, 'n'),
 ('o', 3.0, 'h'),
 (nan, 2.0, 'h')],
 names=['1st', '2nd', '3rd'])
 self = <pandas.core.reshape.merge._MergeOperation object at 0xc9857b30>
pandas/core/reshape/merge.py:2375: in _left_join_on_index
 lkey, rkey = _get_multiindex_indexer(join_keys, right_ax, sort=sort)
 join_keys = [<ArrowStringArrayNumpySemantics>
['v', nan, 'k', 'l', 'i', nan, 'z', 'w', 'o', nan, 'h', 'g', 't', nan, 'y',
 'f', 'y', nan, 'j', 'b', 'i', nan, 's', 'w', 'z', nan, 'f', 'o', 's', nan,
 'y', 'l', 'n', nan, 'p', 'l', 'p', nan, 'm', 'f', 'm', nan, 'b', 'l', 'c',
 nan, 't', 'b', 'g', nan]
Length: 50, dtype: string,
 array([ 8., 2., nan, 2., 4., 8., 4., nan, 3., 6., 8., 7., nan,
 1., 8., 0., 5., nan, 2., 6., 3., 5., nan, 1., 7., 4.,
 6., nan, 9., 4., 2., 6., nan, 9., 8., 6., 3., nan, 0.,
 1., 3., 3., nan, 5., 6., 8., 8., nan, 9., 3.]),
 <ArrowStringArrayNumpySemantics>
['g', 'h', 'v', nan, 'p', 's', 'e', 'b', 'h', nan, 'o', 'd', 'l', 'r', 'k',
 nan, 'z', 'r', 'k', 'e', 'i', nan, 'x', 'u', 'i', 'y', 'm', nan, 'c', 'c',
 'f', 'w', 'r', nan, 'q', 'k', 'n', 'p', 'w', nan, 'x', 'p', 'v', 'm', 's',
 nan, 'n', 'f', 'c', 'b']
Length: 50, dtype: string]
 left_ax = RangeIndex(start=0, stop=50, step=1)
 right_ax = MultiIndex([('z', 4.0, 'e'),
 ('m', 3.0, 'x'),
 (nan, 9.0, nan),
 ('m', 0.0, 'w'),
 ('w', nan, 'b'),
 ('t', 8.0, 'n'),
 ('s', 9.0, 'c'),
 (nan, 8.0, nan),
 (nan, 8.0, 's'),
 ('p', 8.0, 'q'),
 ('t', nan, 'l'),
 ('j', 2.0, 'k'),
 ('o', nan, nan),
 ('l', 2.0, nan),
 (nan, 6.0, nan),
 ('f', 1.0, nan),
 ('b', nan, 'v'),
 ('w', 1.0, 'u'),
 ('v', 8.0, 'g'),
 ('f', 6.0, 'm'),
 ('i', 4.0, 'p'),
 ('h', 8.0, 'o'),
 ('y', 8.0, 'k'),
 (nan, 3.0, 'p'),
 ('y', 5.0, 'z'),
 ('l', 5.0, 'm'),
 ('f', 0.0, nan),
 ('g', 9.0, 'c'),
 (nan, 1.0, 'r'),
 ('z', 7.0, 'i'),
 ('i', 3.0, 'i'),
 (nan, 4.0, 'y'),
 ('s', nan, 'x'),
 (nan, 3.0, 'b'),
 ('k', nan, 'v'),
 ('g', 7.0, 'd'),
 ('n', nan, 'r'),
 ('c', 6.0, 's'),
 ('b', nan, 'f'),
 (nan, nan, 'r'),
 ('b', 6.0, 'e'),
 (nan, nan, 'p'),
 (nan, 5.0, nan),
 (nan, 4.0, 'c'),
 ('l', 6.0, 'w'),
 ('y', 2.0, 'f'),
 ('l', 6.0, 'k'),
 ('p', 3.0, 'n'),
 ('o', 3.0, 'h'),
 (nan, 2.0, 'h')],
 names=['1st', '2nd', '3rd'])
 sort = False
pandas/core/reshape/merge.py:2309: in _get_multiindex_indexer
 zipped = zip(*mapped)
 index = MultiIndex([('z', 4.0, 'e'),
 ('m', 3.0, 'x'),
 (nan, 9.0, nan),
 ('m', 0.0, 'w'),
 ('w', nan, 'b'),
 ('t', 8.0, 'n'),
 ('s', 9.0, 'c'),
 (nan, 8.0, nan),
 (nan, 8.0, 's'),
 ('p', 8.0, 'q'),
 ('t', nan, 'l'),
 ('j', 2.0, 'k'),
 ('o', nan, nan),
 ('l', 2.0, nan),
 (nan, 6.0, nan),
 ('f', 1.0, nan),
 ('b', nan, 'v'),
 ('w', 1.0, 'u'),
 ('v', 8.0, 'g'),
 ('f', 6.0, 'm'),
 ('i', 4.0, 'p'),
 ('h', 8.0, 'o'),
 ('y', 8.0, 'k'),
 (nan, 3.0, 'p'),
 ('y', 5.0, 'z'),
 ('l', 5.0, 'm'),
 ('f', 0.0, nan),
 ('g', 9.0, 'c'),
 (nan, 1.0, 'r'),
 ('z', 7.0, 'i'),
 ('i', 3.0, 'i'),
 (nan, 4.0, 'y'),
 ('s', nan, 'x'),
 (nan, 3.0, 'b'),
 ('k', nan, 'v'),
 ('g', 7.0, 'd'),
 ('n', nan, 'r'),
 ('c', 6.0, 's'),
 ('b', nan, 'f'),
 (nan, nan, 'r'),
 ('b', 6.0, 'e'),
 (nan, nan, 'p'),
 (nan, 5.0, nan),
 (nan, 4.0, 'c'),
 ('l', 6.0, 'w'),
 ('y', 2.0, 'f'),
 ('l', 6.0, 'k'),
 ('p', 3.0, 'n'),
 ('o', 3.0, 'h'),
 (nan, 2.0, 'h')],
 names=['1st', '2nd', '3rd'])
 join_keys = [<ArrowStringArrayNumpySemantics>
['v', nan, 'k', 'l', 'i', nan, 'z', 'w', 'o', nan, 'h', 'g', 't', nan, 'y',
 'f', 'y', nan, 'j', 'b', 'i', nan, 's', 'w', 'z', nan, 'f', 'o', 's', nan,
 'y', 'l', 'n', nan, 'p', 'l', 'p', nan, 'm', 'f', 'm', nan, 'b', 'l', 'c',
 nan, 't', 'b', 'g', nan]
Length: 50, dtype: string,
 array([ 8., 2., nan, 2., 4., 8., 4., nan, 3., 6., 8., 7., nan,
 1., 8., 0., 5., nan, 2., 6., 3., 5., nan, 1., 7., 4.,
 6., nan, 9., 4., 2., 6., nan, 9., 8., 6., 3., nan, 0.,
 1., 3., 3., nan, 5., 6., 8., 8., nan, 9., 3.]),
 <ArrowStringArrayNumpySemantics>
['g', 'h', 'v', nan, 'p', 's', 'e', 'b', 'h', nan, 'o', 'd', 'l', 'r', 'k',
 nan, 'z', 'r', 'k', 'e', 'i', nan, 'x', 'u', 'i', 'y', 'm', nan, 'c', 'c',
 'f', 'w', 'r', nan, 'q', 'k', 'n', 'p', 'w', nan, 'x', 'p', 'v', 'm', 's',
 nan, 'n', 'f', 'c', 'b']
Length: 50, dtype: string]
 mapped = <generator object _get_multiindex_indexer.<locals>.<genexpr> at 0xc9b7a710>
 sort = False
pandas/core/reshape/merge.py:2306: in <genexpr>
 _factorize_keys(index.levels[n]._values, join_keys[n], sort=sort)
 .0 = <range_iterator object at 0xc9856710>
 index = MultiIndex([('z', 4.0, 'e'),
 ('m', 3.0, 'x'),
 (nan, 9.0, nan),
 ('m', 0.0, 'w'),
 ('w', nan, 'b'),
 ('t', 8.0, 'n'),
 ('s', 9.0, 'c'),
 (nan, 8.0, nan),
 (nan, 8.0, 's'),
 ('p', 8.0, 'q'),
 ('t', nan, 'l'),
 ('j', 2.0, 'k'),
 ('o', nan, nan),
 ('l', 2.0, nan),
 (nan, 6.0, nan),
 ('f', 1.0, nan),
 ('b', nan, 'v'),
 ('w', 1.0, 'u'),
 ('v', 8.0, 'g'),
 ('f', 6.0, 'm'),
 ('i', 4.0, 'p'),
 ('h', 8.0, 'o'),
 ('y', 8.0, 'k'),
 (nan, 3.0, 'p'),
 ('y', 5.0, 'z'),
 ('l', 5.0, 'm'),
 ('f', 0.0, nan),
 ('g', 9.0, 'c'),
 (nan, 1.0, 'r'),
 ('z', 7.0, 'i'),
 ('i', 3.0, 'i'),
 (nan, 4.0, 'y'),
 ('s', nan, 'x'),
 (nan, 3.0, 'b'),
 ('k', nan, 'v'),
 ('g', 7.0, 'd'),
 ('n', nan, 'r'),
 ('c', 6.0, 's'),
 ('b', nan, 'f'),
 (nan, nan, 'r'),
 ('b', 6.0, 'e'),
 (nan, nan, 'p'),
 (nan, 5.0, nan),
 (nan, 4.0, 'c'),
 ('l', 6.0, 'w'),
 ('y', 2.0, 'f'),
 ('l', 6.0, 'k'),
 ('p', 3.0, 'n'),
 ('o', 3.0, 'h'),
 (nan, 2.0, 'h')],
 names=['1st', '2nd', '3rd'])
 join_keys = [<ArrowStringArrayNumpySemantics>
['v', nan, 'k', 'l', 'i', nan, 'z', 'w', 'o', nan, 'h', 'g', 't', nan, 'y',
 'f', 'y', nan, 'j', 'b', 'i', nan, 's', 'w', 'z', nan, 'f', 'o', 's', nan,
 'y', 'l', 'n', nan, 'p', 'l', 'p', nan, 'm', 'f', 'm', nan, 'b', 'l', 'c',
 nan, 't', 'b', 'g', nan]
Length: 50, dtype: string,
 array([ 8., 2., nan, 2., 4., 8., 4., nan, 3., 6., 8., 7., nan,
 1., 8., 0., 5., nan, 2., 6., 3., 5., nan, 1., 7., 4.,
 6., nan, 9., 4., 2., 6., nan, 9., 8., 6., 3., nan, 0.,
 1., 3., 3., nan, 5., 6., 8., 8., nan, 9., 3.]),
 <ArrowStringArrayNumpySemantics>
['g', 'h', 'v', nan, 'p', 's', 'e', 'b', 'h', nan, 'o', 'd', 'l', 'r', 'k',
 nan, 'z', 'r', 'k', 'e', 'i', nan, 'x', 'u', 'i', 'y', 'm', nan, 'c', 'c',
 'f', 'w', 'r', nan, 'q', 'k', 'n', 'p', 'w', nan, 'x', 'p', 'v', 'm', 's',
 nan, 'n', 'f', 'c', 'b']
Length: 50, dtype: string]
 n = 0
 sort = False
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
lk = <pyarrow.lib.ChunkedArray object at 0xc983bb18>
[
 [
 "b",
 "c",
 "f",
 "g",
 "h",
 ...
 "t",
 "v",
 "w",
 "y",
 "z"
 ]
]
rk = <pyarrow.lib.ChunkedArray object at 0xc984c7f8>
[
 [
 "v",
 null,
 "k",
 "l",
 "i",
 ...
 null,
 "t",
 "b",
 "g",
 null
 ]
]
sort = False
 def _factorize_keys(
 lk: ArrayLike, rk: ArrayLike, sort: bool = True
 ) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]:
 """
 Encode left and right keys as enumerated types.
 
 This is used to get the join indexers to be used when merging DataFrames.
 
 Parameters
 ----------
 lk : ndarray, ExtensionArray
 Left key.
 rk : ndarray, ExtensionArray
 Right key.
 sort : bool, defaults to True
 If True, the encoding is done such that the unique elements in the
 keys are sorted.
 
 Returns
 -------
 np.ndarray[np.intp]
 Left (resp. right if called with `key='right'`) labels, as enumerated type.
 np.ndarray[np.intp]
 Right (resp. left if called with `key='right'`) labels, as enumerated type.
 int
 Number of unique elements in union of left and right labels.
 
 See Also
 --------
 merge : Merge DataFrame or named Series objects
 with a database-style join.
 algorithms.factorize : Encode the object as an enumerated type
 or categorical variable.
 
 Examples
 --------
 >>> lk = np.array(["a", "c", "b"])
 >>> rk = np.array(["a", "c"])
 
 Here, the unique values are `'a', 'b', 'c'`. With the default
 `sort=True`, the encoding will be `{0: 'a', 1: 'b', 2: 'c'}`:
 
 >>> pd.core.reshape.merge._factorize_keys(lk, rk)
 (array([0, 2, 1]), array([0, 2]), 3)
 
 With the `sort=False`, the encoding will correspond to the order
 in which the unique elements first appear: `{0: 'a', 1: 'c', 2: 'b'}`:
 
 >>> pd.core.reshape.merge._factorize_keys(lk, rk, sort=False)
 (array([0, 1, 2]), array([0, 1]), 3)
 """
 # TODO: if either is a RangeIndex, we can likely factorize more efficiently?
 
 if (
 isinstance(lk.dtype, DatetimeTZDtype) and isinstance(rk.dtype, DatetimeTZDtype)
 ) or (lib.is_np_dtype(lk.dtype, "M") and lib.is_np_dtype(rk.dtype, "M")):
 # Extract the ndarray (UTC-localized) values
 # Note: we dont need the dtypes to match, as these can still be compared
 lk, rk = cast("DatetimeArray", lk)._ensure_matching_resos(rk)
 lk = cast("DatetimeArray", lk)._ndarray
 rk = cast("DatetimeArray", rk)._ndarray
 
 elif (
 isinstance(lk.dtype, CategoricalDtype)
 and isinstance(rk.dtype, CategoricalDtype)
 and lk.dtype == rk.dtype
 ):
 assert isinstance(lk, Categorical)
 assert isinstance(rk, Categorical)
 # Cast rk to encoding so we can compare codes with lk
 
 rk = lk._encode_with_my_categories(rk)
 
 lk = ensure_int64(lk.codes)
 rk = ensure_int64(rk.codes)
 
 elif isinstance(lk, ExtensionArray) and lk.dtype == rk.dtype:
 if (isinstance(lk.dtype, ArrowDtype) and is_string_dtype(lk.dtype)) or (
 isinstance(lk.dtype, StringDtype)
 and lk.dtype.storage in ["pyarrow", "pyarrow_numpy"]
 ):
 import pyarrow as pa
 import pyarrow.compute as pc
 
 len_lk = len(lk)
 lk = lk._pa_array # type: ignore[attr-defined]
 rk = rk._pa_array # type: ignore[union-attr]
 dc = (
 pa.chunked_array(lk.chunks + rk.chunks) # type: ignore[union-attr]
 .combine_chunks()
 .dictionary_encode()
 )
 
 llab, rlab, count = (
 pc.fill_null(dc.indices[slice(len_lk)], -1)
 .to_numpy()
 .astype(np.intp, copy=False),
 pc.fill_null(dc.indices[slice(len_lk, None)], -1)
 .to_numpy()
 .astype(np.intp, copy=False),
 len(dc.dictionary),
 )
 
 if sort:
 uniques = dc.dictionary.to_numpy(zero_copy_only=False)
 llab, rlab = _sort_labels(uniques, llab, rlab)
 
 if dc.null_count > 0:
 lmask = llab == -1
 lany = lmask.any()
 rmask = rlab == -1
 rany = rmask.any()
 if lany:
 np.putmask(llab, lmask, count)
 if rany:
> np.putmask(rlab, rmask, count)
E ValueError: putmask: output array is read-only
count = 19
dc = <pyarrow.lib.DictionaryArray object at 0xc9083ed0>
-- dictionary:
 [
 "b",
 "c",
 "f",
 "g",
 "h",
 "i",
 "j",
 "k",
 "l",
 "m",
 "n",
 "o",
 "p",
 "s",
 "t",
 "v",
 "w",
 "y",
 "z"
 ]
-- indices:
 [
 0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 ...
 9,
 null,
 0,
 8,
 1,
 null,
 14,
 0,
 3,
 null
 ]
lany = False
len_lk = 19
lk = <pyarrow.lib.ChunkedArray object at 0xc983bb18>
[
 [
 "b",
 "c",
 "f",
 "g",
 "h",
 ...
 "t",
 "v",
 "w",
 "y",
 "z"
 ]
]
llab = array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
 17, 18])
lmask = array([False, False, False, False, False, False, False, False, False,
 False, False, False, False, False, False, False, False, False,
 False])
pa = <module 'pyarrow' from '/usr/lib/python3.11/site-packages/pyarrow/__init__.py'>
pc = <module 'pyarrow.compute' from '/usr/lib/python3.11/site-packages/pyarrow/compute.py'>
rany = True
rk = <pyarrow.lib.ChunkedArray object at 0xc984c7f8>
[
 [
 "v",
 null,
 "k",
 "l",
 "i",
 ...
 null,
 "t",
 "b",
 "g",
 null
 ]
]
rlab = array([15, -1, 7, 8, 5, -1, 18, 16, 11, -1, 4, 3, 14, -1, 17, 2, 17,
 -1, 6, 0, 5, -1, 13, 16, 18, -1, 2, 11, 13, -1, 17, 8, 10, -1,
 12, 8, 12, -1, 9, 2, 9, -1, 0, 8, 1, -1, 14, 0, 3, -1])
rmask = array([False, True, False, False, False, True, False, False, False,
 True, False, False, False, True, False, False, False, True,
 False, False, False, True, False, False, False, True, False,
 False, False, True, False, False, False, True, False, False,
 False, True, False, False, False, True, False, False, False,
 True, False, False, False, True])
sort = False
pandas/core/reshape/merge.py:2514: ValueError

Full build & test log (2.5M .gz, 52M uncompressed): pandas.txt.gz

This is on Gentoo/x86 systemd-nspawn container. I'm using -O2 -march=pentium-m -mfpmath=sse -pipe flags to rule out i387-specific precision issues.

I've also filed apache/arrow#40153 for test failures in pyarrow itself. Some of them could be possibly be bugs in pandas instead.

Expected Behavior

Tests passing ;-).

Installed Versions

INSTALLED VERSIONS

commit : fd3f571
python : 3.11.7.final.0
python-bits : 32
OS : Linux
OS-release : 6.7.5-gentoo-dist
Version : #1 SMP PREEMPT_DYNAMIC Sat Feb 17 07:30:27 -00 2024
machine : x86_64
processor : AMD Ryzen 5 3600 6-Core Processor
byteorder : little
LC_ALL : None
LANG : C.UTF8
LOCALE : en_US.UTF-8

pandas : 2.2.0
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.0.3
pip : None
Cython : 3.0.5
pytest : 7.4.4
hypothesis : 6.98.3
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.2.0
lxml.etree : 4.9.4
html5lib : 1.1
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.3
numba : None
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.27
tables : 3.9.2
tabulate : 0.9.0
xarray : 202420
xlrd : 2.0.1
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    32bit32-bit systems Arrowpyarrow functionality Bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /