Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat(load): fallback to load_from_disk() when loading a saved dataset directory #7653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ArjunJagdale wants to merge 1 commit into huggingface:main
base: main
Choose a base branch
Loading
from ArjunJagdale:patch-15

Conversation

@ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Jun 28, 2025

Related Issue

Fixes #7503
Partially addresses #5044 by allowing load_dataset() to auto-detect and gracefully delegate to load_from_disk() for locally saved datasets.


What does this PR do?

This PR introduces a minimal fallback mechanism in load_dataset() that detects when the provided path points to a dataset saved using save_to_disk(), and automatically redirects to load_from_disk().

🐛 Before (unexpected metadata-only rows):

ds = load_dataset("/path/to/saved_dataset")
# → returns rows with only internal metadata (_data_files, _fingerprint, etc.)

✅ After (graceful fallback):

ds = load_dataset("/path/to/saved_dataset")
# → logs a warning and internally switches to load_from_disk()

Why is this useful?

  • Prevents confusion when reloading local datasets saved via save_to_disk().
  • Enables smoother compatibility with frameworks (e.g., TRL, lighteval) that rely on load_dataset() calls.
  • Fully backward-compatible — hub-based loading, custom builders, and streaming remain untouched.

...`load_dataset`
### Related Issue
Fixes huggingface#7503
### What does this PR do?
This PR introduces a fallback mechanism in `load_dataset()` that detects when the input `path` points to a dataset previously saved using `save_to_disk()`, and automatically redirects to `load_from_disk(path)`.
Previously, calling `load_dataset("/path/to/saved/dataset")` would misinterpret the local structure and return incorrect metadata rows. Now:
```python
# Before: unexpected result
ds = load_dataset("my_saved_dataset") # Misinterprets metadata
# After: correct behavior
ds = load_dataset("my_saved_dataset") # Auto-switches to load_from_disk()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

Inconsistency between load_dataset and load_from_disk functionality

1 participant

AltStyle によって変換されたページ (->オリジナル) /