Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[codex] Resolve dataset attribute schema updates#664

Open
faizzyhon wants to merge 1 commit into
ARBML:main from
faizzyhon:codex/resolve-issue-663
Open

[codex] Resolve dataset attribute schema updates #664
faizzyhon wants to merge 1 commit into
ARBML:main from
faizzyhon:codex/resolve-issue-663

Conversation

@faizzyhon

@faizzyhon faizzyhon commented Jun 10, 2026

Copy link
Copy Markdown

Summary

Closes #663.

This updates the catalogue schema, all existing dataset records, and the website consumers for the requested attribute changes.

What changed

  • adds Partial and the conversations unit
  • renames the existing source-oriented Domain values to Source
  • adds topical Domain values: General, Legal, News, Quran, and Science
  • restricts Collection Style to either crawling or manual curation
  • adds Annotation Style, including no annotation
  • renames Test Split to Has Splits
  • renames existing dialect-specific Subsets to Dialect Subsets
  • adds a separate generic Subsets field for non-dialect dataset subsets
  • normalizes existing venue names and defines them as schema options, with other as an escape hatch
  • updates cards, search, statistics, fuzzy search, plots, validation, and README documentation

Migration rules

  • existing source categories are preserved under Source
  • Domain defaults to General, with conservative classification for clear News, Quran, Legal, and Science records
  • Collection Style is crawling when crawling was previously present; otherwise it is manual curation
  • existing human, machine, and LLM annotation values move to Annotation Style; entries without an explicit annotation method use no annotation, while unknown values remain other
  • existing dialect subset data moves to Dialect Subsets; generic Subsets starts empty
  • existing entries default to Partial: false

Validation

  • USE_LOCAL_SCHEMA=true python validate_schema.py
  • node --check on all modified JavaScript files
  • python -m py_compile validate_schema.py plots.py
  • full catalogue audit confirming all 1,120 tracked dataset files contain the new required fields and no Test Split field

Jekyll rendering was not run because Ruby/Bundler is unavailable in the local environment.

@faizzyhon faizzyhon mentioned this pull request Jun 10, 2026
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

Attributes to add/modify

1 participant

AltStyle によって変換されたページ (->オリジナル) /