Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

DS4SD/PatCID

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

36 Commits

Repository files navigation

PatCID

This is the repository for PatCID: an open-access dataset of chemical structures in patent documents. PatCID is a dataset of molecules linked to the patent document displaying them.

PatCID

Citation

If you find this repository useful, please consider citing:

@article{Morin2024,
	title = {{PatCID: an open-access dataset of chemical structures in patent documents}},
	author = {Morin, Lucas and Weber, Val{\'e}ry and Meijer, Gerhard Ingmar and Yu, Fisher and Staar, Peter W. J.},
	year = 2024,
	month = {Aug},
	day = {02},
	journal = {Nature Communications},
	volume = 15,
	number = 1,
	pages = 6532,
	doi = {10.1038/s41467-024-50779-y},
	issn = {2041-1723},
	url = {https://doi.org/10.1038/s41467-024-50779-y}
}

Installation

Create a virtual environment.

conda create -n patcid python=3.11
conda activate patcid

Install poppler.

Linux: apt-get install poppler-utils 
Mac: brew install poppler 

Install python dependencies.

pip install -e .

Download PatCID Dataset

The PatCID dataset is available on Zenodo.

wget https://zenodo.org/records/10572870/files/patcid.zip?download=1 -O patcid.zip
unzip patcid.zip -d ./data/patcid/

(Download size: 5.7 GB, files format: .jsonl)

Document Retrieval

Run the notebook ./examples/molecule_query.ipynb to use PatCID to retrieve documents referencing a molecule of interest.

Molecule Retrieval

Run the notebook ./examples/patent_query.ipynb to use PatCID to retrieve molecules displayed in a given patent document.

User Interface

user_interface.mp4

To request access to the above user interface, please contact the IBM's Deep Search team at deepsearch-core@zurich.ibm.com.

Benchmark Datasets

The benchmarks datasets D2C-UNI and D2C-RND are available on Zenodo.

Code

The code repositories used to build and evaluate PatCID are available:

For segmenting chemical-structure images from documents, we use DECIMER Segmentation from K. Rajan, H. O. Brinkhaus, M. Sorokina, A. Zielesny and C. Steinbeck.

Models

The model weights are available on Hugging Face:

Training Datasets

The training datasets are available on Zenodo and Hugging Face:

Additional Visualization

To test our processing pipeline outside its main application domain, we process a scientific publication published on ChemRxiv. ./data/extra/scientific_paper_example/ contains the pages of the document (page_*.png) annotated with the segmentation and classification predictions. For pages containing molecules, the predicted molecules are provided in page_*_molecules.txt.

About

[Nat. Commun.] PatCID: an open-access dataset of chemical structures in patent documents

Resources

License

Stars

Watchers

Forks

Packages

Contributors

Languages

AltStyle によって変換されたページ (->オリジナル) /