@article{Morin2024,
	title = {{PatCID: an open-access dataset of chemical structures in patent documents}},
	author = {Morin, Lucas and Weber, Val{\'e}ry and Meijer, Gerhard Ingmar and Yu, Fisher and Staar, Peter W. J.},
	year = 2024,
	month = {Aug},
	day = {02},
	journal = {Nature Communications},
	volume = 15,
	number = 1,
	pages = 6532,
	doi = {10.1038/s41467-024-50779-y},
	issn = {2041-1723},
	url = {https://doi.org/10.1038/s41467-024-50779-y}
}

Installation

Create a virtual environment.

conda create -n patcid python=3.11
conda activate patcid

Install poppler.

Linux: apt-get install poppler-utils 
Mac: brew install poppler

Install python dependencies.

pip install -e .

Download PatCID Dataset

The PatCID dataset is available on Zenodo.

wget https://zenodo.org/records/10572870/files/patcid.zip?download=1 -O patcid.zip
unzip patcid.zip -d ./data/patcid/

(Download size: 5.7 GB, files format: .jsonl)

Document Retrieval

Run the notebook ./examples/molecule_query.ipynb to use PatCID to retrieve documents referencing a molecule of interest.

Molecule Retrieval

Run the notebook ./examples/patent_query.ipynb to use PatCID to retrieve molecules displayed in a given patent document.

User Interface

user_interface.mp4

To request access to the above user interface, please contact the IBM's Deep Search team at deepsearch-core@zurich.ibm.com.

Benchmark Datasets

The benchmarks datasets D2C-UNI and D2C-RND are available on Zenodo.

Code

The code repositories used to build and evaluate PatCID are available:

For segmenting chemical-structure images from documents, we use DECIMER Segmentation from K. Rajan, H. O. Brinkhaus, M. Sorokina, A. Zielesny and C. Steinbeck.

Models

The model weights are available on Hugging Face:

The classification model
The recognition model.

Training Datasets

The training datasets are available on Zenodo and Hugging Face:

Additional Visualization

To test our processing pipeline outside its main application domain, we process a scientific publication published on ChemRxiv. ./data/extra/scientific_paper_example/ contains the pages of the document (page_*.png) annotated with the segmentation and classification predictions. For pages containing molecules, the predicted molecules are provided in page_*_molecules.txt.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DS4SD/PatCID

Folders and files

Latest commit

History

Repository files navigation

PatCID

Citation

Installation

Download PatCID Dataset

Document Retrieval

Molecule Retrieval

User Interface

Benchmark Datasets

Code

Models

Training Datasets

Additional Visualization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PatCID

Citation

Installation

Download PatCID Dataset

Document Retrieval

Molecule Retrieval

User Interface

Benchmark Datasets

Code

Models

Training Datasets

Additional Visualization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages