PARTIAL FIX: Improve leading zeros preservation with dtype=str for dict-based dtypes #62242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

dxdc wants to merge 20 commits into pandas-dev:main

from dxdc:patch-2

Open

PARTIAL FIX: Improve leading zeros preservation with dtype=str for dict-based dtypes #62242

dxdc wants to merge 20 commits into pandas-dev:main from dxdc:patch-2

+98 −0

Conversation

dxdc

Copy link

@dxdc dxdc commented Sep 2, 2025 •

edited

Loading

Summary

This PR partially addresses issue #57666 by improving leading zeros preservation when dtype=str is used with dictionary-based dtype specifications. While the global dtype=str issue with pyarrow engine remains unfixed, this PR resolves the problem for more targeted dtype specifications.

Problem

Issue #57666 reports that the pyarrow engine does not preserve leading zeros in numeric-looking strings when dtype=str is specified, while other engines correctly preserve them.

Solution

Fixed: Dictionary-based dtype specifications (dtype={'col': str}) now properly preserve leading zeros across all engines
Partial: Global dtype=str still fails with pyarrow engine (marked with xfail for now)
Added: Test coverage for dtype specification patterns

What's Fixed vs Still Broken

✅ Now Working:

# This now preserve leading zeros correctly across all engines:
pd.read_csv(data, dtype={'col2': str, 'col3': int, 'col4': str})

⚠️ Still Broken (pyarrow only):

# This still strips leading zeros with pyarrow engine:
pd.read_csv(data, dtype=str) # global string dtype

Next Steps

This PR provides a foundation for the complete fix. The remaining work involves:

Fully resolving the pyarrow engine's global dtype handling
Removing the xfail marker once completely resolved
Improving the pyarrow engine's dtype enforcement during parsing rather than post-processing conversion

Checklist

Tests added and passed
All code checks passed
closes BUG: pyarrow stripping leading zeros with dtype=str #57666
Added entry in doc/source/whatsnew/v3.0.0.rst

Files Changed

pandas/io/parsers/arrow_parser_wrapper.py - Fix for dict-based dtypes
pandas/tests/io/parser/test_preserve_leading_zeros.py - Comprehensive test suite

Test Output

C engine: ✅ All tests pass
Python engine: ✅ All tests pass
PyArrow engine:
- Dict-based dtypes now pass (with strings)
- ⚠️ Global dtype=str marked as xfail (temporary)

dxdc added 2 commits

Uh oh!

PARTIAL FIX: Improve leading zeros preservation with dtype=str for dict-based dtypes #62242

Are you sure you want to change the base?

PARTIAL FIX: Improve leading zeros preservation with dtype=str for dict-based dtypes #62242

Conversation

@dxdc dxdc commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

What's Fixed vs Still Broken

✅ Now Working:

⚠️ Still Broken (pyarrow only):

Next Steps

Checklist

Files Changed

Test Output

Uh oh!

jbrockmendel commented Sep 2, 2025

Uh oh!

dxdc commented Sep 2, 2025

Uh oh!

dxdc commented Sep 3, 2025

Uh oh!

jbrockmendel commented Sep 3, 2025

Uh oh!

@jbrockmendel jbrockmendel Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

@dxdc dxdc Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

@jbrockmendel jbrockmendel Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

@dxdc dxdc Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jbrockmendel commented Sep 3, 2025

Uh oh!

dxdc commented Sep 3, 2025

Uh oh!

dxdc commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Known Issues / Remaining Work

Uh oh!

Uh oh!

@dxdc dxdc commented Sep 2, 2025 •

edited

Loading

dxdc commented Sep 4, 2025 •

edited

Loading