Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Add columns support to JSON loader for selective key filtering #7652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ArjunJagdale wants to merge 11 commits into huggingface:main
base: main
Choose a base branch
Loading
from ArjunJagdale:patch-14

Conversation

Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Jun 27, 2025
edited
Loading

Fixes #7594
This PR adds support for filtering specific columns when loading datasets from .json or .jsonl files — similar to how the columns=... argument works for Parquet.

As suggested, support for the columns=... argument (previously available for Parquet) has now been extended to JSON and JSONL loading via load_dataset(...). You can now load only specific keys/columns and skip the rest — which should help in cases where some fields are unclean, inconsistent, or just unnecessary.

Example:

from datasets import load_dataset
dataset = load_dataset("json", data_files="your_data.jsonl", columns=["id", "title"])
print(dataset["train"].column_names)
# Output: ['id', 'title']

Summary of changes:

  • Added columns: Optional[List[str]] to JsonConfig
  • Updated _generate_tables() to filter selected columns
  • Forwarded columns argument from load_dataset() to the config
  • Added test for validation(should be fine!)

Let me know if you'd like the same to be added for CSV or others as a follow-up — happy to help.

aihao2000 and liziniu reacted with thumbs up emoji
@ArjunJagdale ArjunJagdale changed the title (削除) temp1 (削除ここまで) (追記) Add columns parameter to JSON loader to filter selected columns during loading (追記ここまで) Jun 27, 2025
@ArjunJagdale ArjunJagdale changed the title (削除) Add columns parameter to JSON loader to filter selected columns during loading (削除ここまで) (追記) Add columns support to JSON loader for selective key filtering (追記ここまで) Jun 27, 2025
Copy link

I need this feature right now. It would be great if it could automatically fill in None for non-existent keys instead of reporting an error.

ArjunJagdale reacted with thumbs up emoji

Copy link
Contributor Author

I need this feature right now. It would be great if it could automatically fill in None for non-existent keys instead of reporting an error.

Hi @aihao2000, Just to confirm — I have done the changes you asked for!
If you pass columns=["key1", "key2", "optional_key"] to load_dataset(..., columns=...), and any of those keys are missing from the input JSON objects, the loader will automatically fill those columns with None values, instead of raising an error.

aihao2000 reacted with thumbs up emoji

Copy link
Contributor Author

Hi! any update on this PR?

Kitsunp reacted with thumbs up emoji Kitsunp reacted with eyes emoji

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! I added a few comments :)

Comment on lines -116 to -131
# Use block_size equal to the chunk size divided by 32 to leverage multithreading
# Set a default minimum value of 16kB if the chunk size is really small
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this comment deletion and the 2 others

Copy link
Contributor Author

@ArjunJagdale ArjunJagdale Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this comment deletion and the 2 others

Wanted clarification on "the 2 others" to ensure no comment restorations were missed. Actually i have restored the two missing comments above - are they at the right place? :)

Comment on lines 145 to 150
if self.config.columns is not None:
missing_cols = [col for col in self.config.columns if col not in pa_table.column_names]
for col in missing_cols:
pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows))
pa_table = pa_table.select(self.config.columns)
yield (file_idx, batch_idx), self._cast_table(pa_table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this at the end, where you removed the yield - this way the try/except is only about the paj.read_json call

for col in missing_cols:
pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows))
pa_table = pa_table.select(self.config.columns)
yield (file_idx, batch_idx), self._cast_table(pa_table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 183 to 184
# Pandas fallback in case of ArrowInvalid
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is not at the right location anymore: it should trigger on ArrowInvalid

Copy link
Contributor Author

@ArjunJagdale ArjunJagdale Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve moved the Pandas fallback into the except pa.ArrowInvalid block, will you check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Reviewers

@lhoestq lhoestq lhoestq left review comments

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

Add option to ignore keys/columns when loading a dataset from jsonl(or any other data format)

AltStyle によって変換されたページ (->オリジナル) /