Add columns support to JSON loader for selective key filtering #7652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ArjunJagdale wants to merge 11 commits into huggingface:main

from ArjunJagdale:patch-14

Open

Add columns support to JSON loader for selective key filtering #7652

ArjunJagdale wants to merge 11 commits into huggingface:main from ArjunJagdale:patch-14

Conversation

ArjunJagdale

Copy link

Contributor

@ArjunJagdale ArjunJagdale commented Jun 27, 2025 •

edited

Loading

Fixes #7594
This PR adds support for filtering specific columns when loading datasets from .json or .jsonl files — similar to how the columns=... argument works for Parquet.

As suggested, support for the columns=... argument (previously available for Parquet) has now been extended to JSON and JSONL loading via load_dataset(...). You can now load only specific keys/columns and skip the rest — which should help in cases where some fields are unclean, inconsistent, or just unnecessary.

Example:

from datasets import load_dataset
dataset = load_dataset("json", data_files="your_data.jsonl", columns=["id", "title"])
print(dataset["train"].column_names)
# Output: ['id', 'title']

Summary of changes:

Added columns: Optional[List[str]] to JsonConfig
Updated _generate_tables() to filter selected columns
Forwarded columns argument from load_dataset() to the config
Added test for validation(should be fine!)

Let me know if you'd like the same to be added for CSV or others as a follow-up — happy to help.

ArjunJagdale added 3 commits

June 27, 2025 21:48

@ArjunJagdale


 temp1

db75657

temp2

@ArjunJagdale


 Update load.py

c7872cb

@ArjunJagdale


 Update test_json.py

a0fedf5

@ArjunJagdale ArjunJagdale changed the title ~~(削除) temp1 (削除ここまで)~~ (追記) Add columns parameter to JSON loader to filter selected columns during loading (追記ここまで)

Jun 27, 2025

@ArjunJagdale ArjunJagdale changed the title ~~(削除) Add columns parameter to JSON loader to filter selected columns during loading (削除ここまで)~~ (追記) Add columns support to JSON loader for selective key filtering (追記ここまで)

Jun 27, 2025

@aihao2000

Copy link

aihao2000 commented Jul 3, 2025

I need this feature right now. It would be great if it could automatically fill in None for non-existent keys instead of reporting an error.

@ArjunJagdale


 Update json.py

d23a48b

@ArjunJagdale

Copy link

Contributor Author

ArjunJagdale commented Jul 3, 2025

I need this feature right now. It would be great if it could automatically fill in None for non-existent keys instead of reporting an error.

Hi @aihao2000, Just to confirm — I have done the changes you asked for!
If you pass columns=["key1", "key2", "optional_key"] to load_dataset(..., columns=...), and any of those keys are missing from the input JSON objects, the loader will automatically fill those columns with None values, instead of raising an error.

@ArjunJagdale

Copy link

Contributor Author

ArjunJagdale commented Jul 14, 2025

Hi! any update on this PR?

lhoestq

lhoestq reviewed

Aug 13, 2025

View reviewed changes

Copy link

Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! I added a few comments :)

src/datasets/load.py Outdated Show resolved Hide resolved

src/datasets/packaged_modules/json/json.py

Comment on lines -116 to -131

# Use block_size equal to the chunk size divided by 32 to leverage multithreading

# Set a default minimum value of 16kB if the chunk size is really small

Copy link

Member

@lhoestq lhoestq Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this comment deletion and the 2 others

Copy link

Contributor Author

@ArjunJagdale ArjunJagdale Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this comment deletion and the 2 others

Wanted clarification on "the 2 others" to ensure no comment restorations were missed. Actually i have restored the two missing comments above - are they at the right place? :)

lhoestq

lhoestq reviewed

Aug 13, 2025

View reviewed changes

src/datasets/packaged_modules/json/json.py Outdated

Comment on lines 145 to 150

if self.config.columns is not None:

missing_cols = [col for col in self.config.columns if col not in pa_table.column_names]

for col in missing_cols:

pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows))

pa_table = pa_table.select(self.config.columns)

yield (file_idx, batch_idx), self._cast_table(pa_table)

Copy link

Member

@lhoestq lhoestq Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this at the end, where you removed the yield - this way the try/except is only about the paj.read_json call

lhoestq

lhoestq reviewed

Aug 13, 2025

View reviewed changes

src/datasets/packaged_modules/json/json.py Outdated

for col in missing_cols:

pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows))

pa_table = pa_table.select(self.config.columns)

yield (file_idx, batch_idx), self._cast_table(pa_table)

Copy link

Member

@lhoestq lhoestq Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

ArjunJagdale and others added 5 commits

August 15, 2025 00:18

@ArjunJagdale @lhoestq


 Update src/datasets/load.py

5d3cc12

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

@ArjunJagdale @lhoestq


 Update src/datasets/load.py

eec7df9

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

@ArjunJagdale


 Update json.py

5e93f70

@ArjunJagdale


 Update json.py

608ed21

@ArjunJagdale


 Merge branch 'huggingface:main' into patch-14

9fa38b4

lhoestq

lhoestq reviewed

Aug 18, 2025

View reviewed changes

src/datasets/packaged_modules/json/json.py Outdated

Comment on lines 183 to 184

# Pandas fallback in case of ArrowInvalid

try:

Copy link

Member

@lhoestq lhoestq Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is not at the right location anymore: it should trigger on ArrowInvalid

Copy link

Contributor Author

@ArjunJagdale ArjunJagdale Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve moved the Pandas fallback into the except pa.ArrowInvalid block, will you check?

ArjunJagdale added 2 commits

August 26, 2025 23:03

@ArjunJagdale


 Merge branch 'huggingface:main' into patch-14

d05759a

@ArjunJagdale


 Update json.py

428444d

Labels

None yet

3 participants

@ArjunJagdale @aihao2000 @lhoestq

Add columns support to JSON loader for selective key filtering #7652

Are you sure you want to change the base?

Add columns support to JSON loader for selective key filtering #7652

Uh oh!

Conversation

@ArjunJagdale ArjunJagdale commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example:

Summary of changes:

Uh oh!

aihao2000 commented Jul 3, 2025

Uh oh!

ArjunJagdale commented Jul 3, 2025

Uh oh!

ArjunJagdale commented Jul 14, 2025

Uh oh!

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

@lhoestq lhoestq Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

@ArjunJagdale ArjunJagdale Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

@lhoestq lhoestq Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

@lhoestq lhoestq Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

@lhoestq lhoestq Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

@ArjunJagdale ArjunJagdale Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

@ArjunJagdale ArjunJagdale commented Jun 27, 2025 •

edited

Loading