This site needs JavaScript to work properly. Please enable it to take advantage of the complete set of features!
Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log in
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov 13;11(1):5753.
doi: 10.1038/s41467-020-19594-z.

Machine learning with physicochemical relationships: solubility prediction in organic solvents and water

Affiliations

Machine learning with physicochemical relationships: solubility prediction in organic solvents and water

Samuel Boobier et al. Nat Commun. .

Abstract

Solubility prediction remains a critical challenge in drug development, synthetic route and chemical process design, extraction and crystallisation. Here we report a successful approach to solubility prediction in organic solvents and water using a combination of machine learning (ANN, SVM, RF, ExtraTrees, Bagging and GP) and computational chemistry. Rational interpretation of dissolution process into a numerical problem led to a small set of selected descriptors and subsequent predictions which are independent of the applied machine learning method. These models gave significantly more accurate predictions compared to benchmarked open-access and commercial tools, achieving accuracy close to the expected level of noise in training data (LogS ± 0.7). Finally, they reproduced physicochemical relationship between solubility and molecular properties in different solvents, which led to rational approaches to improve the accuracy of each models.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Concepts of solubility prediction and data availability.
a Physical aspects of dissolution process of solid and corresponding descriptors. b Curated solubility datasets for this study and their LogS distributions (N = number of datapoints, T = number of datapoints in training set, S = number of datapoints in test set).
Fig. 2
Fig. 2. Results of initial machine learning prediction models.
a Descriptor correlation analysis, b principal component analysis of the descriptors with Water_set_wide; and plots of predicted vs experimental LogS, with predicted errors, using GP algorithm for c Water_wide_set, d Water_narrow_set, e Ethanol_set, f Benzene_set, g Acetone_set; and h distributions of predicted errors (1 standard deviation) for each dataset with GP; and i impact of the removal of a single descriptor on ET prediction models (blue: Water_set_wide, orange: Benzene_set), j feature importance plot for ET prediction models (blue: Water_set_wide, orange: Benzene_set).
Fig. 3
Fig. 3. Benchmarking results against other predictive models.
Predicted vs experimental LogS for Water_set_wide a ET model; b GSE model; c AquaSol model; d EPI Suite 1 model, e EPI Suite 2 model; f COSMOtherm calculations; for Ethanol_set g, ET model; h COSMOtherm calculations; for Benzene_set i ET model; j COSMOtherm calculations; for Acetone_set k ET model; l COSMOtherm calculations; and prediction results using datasets from AstraZeneca m functional group distribution analysis for dataset from AstraZeneca and Water_set_wide; predicted vs experimental LogS for n ET model for AZ_water (without m.p.); o ET model for AZ_ethanol (without m.p.); p ET model for AZ_acetone (without m.p.); q COSMOtherm calculations for AZ_water; r COSMOtherm calculations for AZ_ethanol; and s COSMOtherm calculations for AZ_acetone.

References

    1. Bergström CAS, Larsson P. Computational prediction of drug solubility in water-based systems: qualitative and quantitative approaches used in the current drug discovery and development setting. Int. J. Pharm. 2018;540:185–193. doi: 10.1016/j.ijpharm.2018年01月04日4. - DOI - PMC - PubMed
    1. Bergström CAS, Charman WN, Porter CJH. Computational prediction of formulation strategies for beyond-rule-of-5 compounds. Adv. Drug Deliv. Rev. 2016;101:6–21. doi: 10.1016/j.addr.201602005. - DOI - PubMed
    1. Khurana S, et al. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics. 2018;34:2605–2613. doi: 10.1093/bioinformatics/bty166. - DOI - PMC - PubMed
    1. Sormanni P, Aprile FA, Vendruscolo M. The CamSol method of rational design of protein mutants with enhanced solubility. J. Mol. Biol. 2015;427:478–490. doi: 10.1016/j.jmb.2014年09月02日6. - DOI - PubMed
    1. Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J. Protein–Sol: a web tool for predicting protein solubility from sequence. Bioinformatics. 2017;33:3098–3100. doi: 10.1093/bioinformatics/btx345. - DOI - PMC - PubMed

Publication types

Cite

AltStyle によって変換されたページ (->オリジナル) /