-
Notifications
You must be signed in to change notification settings - Fork 45
Comments
Conversation
hssn-20
commented
Apr 13, 2023
Hopefully this PR should be ok for our first version of this dataset. In our next version, I'd like to remove exercises along with their solutions from the dataset + encode chemicals in a consistent format. Ps.
pre-commit.ci autofix
for more information, see https://pre-commit.ci
Hey @hssn-20, thank you very much for the PR! 🙏
I just had a look and I triggered the pre commit checks on GitHub, see the results here: https://results.pre-commit.ci/run/github/601226793/1681519715.6rdNlKF6QWaniPzvuAnS1g (the links is at the end below too).
Best is you (merge the latest main again), then be sure that the latest pre-commit hooks are installed properly with pre-commit install, and then run black . (both in the main directory) to auto-format the code.
Then you can rerun the yaml creation with python transform.py and add those changes in a new commit to the PR.
Just let me know if you can add those changes, if not, I can also have a look. 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not used below. Are those lines already removed on the HF dataset upload?
data/libre_textbooks/transform.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did run the commit hooks through with "OTHER" (capital letters)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script imports an uploaded libre chemistry textbooks from Hugging Face, cleans the data by removing hyperlinks, licenses, and chapter headers, and then removes specific lines based on manual selection. The cleaned data is then saved, and a metadata YAML file is generated based on a template. Here's a colab notebook which implements the process.