Search code, repositories, users, issues, pull requests...

Copy link

@hssn-20 hssn-20 commented Apr 2, 2023

This script imports an uploaded libre chemistry textbooks from Hugging Face, cleans the data by removing hyperlinks, licenses, and chapter headers, and then removes specific lines based on manual selection. The cleaned data is then saved, and a metadata YAML file is generated based on a template. Here's a colab notebook which implements the process.


 initial commit

70a9dd1

@MicPie MicPie assigned hssn-20

Apr 12, 2023

@MicPie MicPie requested review from MicPie and kjappelbaum

April 12, 2023 14:12

@MicPie MicPie added dataset needs-review and removed needs-review labels

Apr 12, 2023

hssn-20 added 6 commits

April 13, 2023 03:32


 Merge branch 'OpenBioML:main' into add-libre-textbooks

c4b61fd


 Add files via upload


 Update transform.py

26b4488


 Update transform.py

5de34a7


 Libre-textbook web crawler

c87307a


 version 1

0bd8b97

@hssn-20 hssn-20 changed the title ~~(削除) Draft: Adding the libre textbooks (削除ここまで)~~ (追記) Adding the libre textbooks (追記ここまで)

Apr 13, 2023

Copy link

Author

hssn-20 commented Apr 13, 2023

Hopefully this PR should be ok for our first version of this dataset. In our next version, I'd like to remove exercises along with their solutions from the dataset + encode chemicals in a consistent format. Ps.

Copy link

Author

hssn-20 commented Apr 13, 2023 •

edited

Loading

pre-commit.ci autofix

pre-commit-ci bot and others added 9 commits

April 13, 2023 18:35

@pre-commit-ci


 [pre-commit.ci] auto fixes from pre-commit.com hooks

579cc79

for more information, see https://pre-commit.ci


 fixing linting issues

41fd816


 Update transform.py

41b5dfc


 Merge branch 'OpenBioML:main' into add-libre-textbooks

9c68bf9


 Add files via upload

bfcdb4c


 Add files via upload

8050d7e


 Delete lines_to_remove.jsonl

f17e326


 Delete top_sentances.ods

85f161e


 Update transform.py

706442d

@MicPie

Copy link

Contributor

MicPie commented Apr 17, 2023 •

edited

Loading

Hey @hssn-20, thank you very much for the PR! 🙏
I just had a look and I triggered the pre commit checks on GitHub, see the results here: https://results.pre-commit.ci/run/github/601226793/1681519715.6rdNlKF6QWaniPzvuAnS1g (the links is at the end below too).
Best is you (merge the latest main again), then be sure that the latest pre-commit hooks are installed properly with pre-commit install, and then run black . (both in the main directory) to auto-format the code.
Then you can rerun the yaml creation with python transform.py and add those changes in a new commit to the PR.
Just let me know if you can add those changes, if not, I can also have a look. 😃

@MicPie MicPie added the Awaiting author contribution label

Apr 17, 2023

MicPie

MicPie requested changes

Apr 17, 2023

data/libre_textbooks/transform.py

import yaml

LINES_TO_REMOVE = "/workspaces/chemnlp/data/libre_textbooks/lines_to_remove.jsonl"

Copy link

Contributor

@MicPie MicPie Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used below. Are those lines already removed on the HF dataset upload?

data/libre_textbooks/transform.py Outdated

"identifiers": [

{

"id": "url ", # column name

"type": "OTHER", # can be "SMILES", "SELFIES", "IUPAC", "OTHER"

Copy link

Contributor

@MicPie MicPie Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did run the commit hooks through with "OTHER" (capital letters)?

hssn-20 added 2 commits

April 18, 2023 15:18


 Merge branch 'OpenBioML:main' into add-libre-textbooks

af2982c


 Merge branch 'OpenBioML:main' into add-libre-textbooks

78ff8f7

kjappelbaum reviewed

data/libre_textbooks/transform.py Outdated Show resolved Hide resolved


 Update data/libre_textbooks/transform.py

acd40c3

kjappelbaum reviewed

data/libre_textbooks/transform.py

"id": "html", # name of the column in a tabular dataset

"description": "A scraped page from libre textbooks",

"units": None, # units of the values in this column (leave empty if unitless)

"type": "string", # can be "categorical", "ordinal", "continuous", "string"

Copy link

Collaborator

@kjappelbaum kjappelbaum May 5, 2023 •

edited by MicPie

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

"type": "string", # can be "categorical", "ordinal", "continuous", "string"

"type": "text", # can be "categorical", "ordinal", "continuous", "text"

kjappelbaum reviewed

data/libre_textbooks/meta.yaml Outdated Show resolved Hide resolved


 Update data/libre_textbooks/meta.yaml

12f854f

kjappelbaum reviewed

data/libre_textbooks/transform.py Outdated Show resolved Hide resolved


 Update data/libre_textbooks/transform.py

2e0e0fd

kjappelbaum reviewed

data/libre_textbooks/transform.py Outdated Show resolved Hide resolved


 Update data/libre_textbooks/transform.py

2ba93b0

kjappelbaum reviewed

data/libre_textbooks/meta.yaml

Comment on lines +17 to +19

- id: text_length

type: int

description: text character count

Copy link

Collaborator

@kjappelbaum kjappelbaum May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

- id: text_length

type: int

description: text character count

kjappelbaum reviewed

data/libre_textbooks/meta.yaml Outdated Show resolved Hide resolved


 Update data/libre_textbooks/meta.yaml

b34267c

@kjappelbaum kjappelbaum requested a review from MicPie

May 5, 2023 11:34