This site needs JavaScript to work properly. Please enable it to take advantage of the complete set of features!
Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log in
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 15;18(12):e1010718.
doi: 10.1371/journal.pcbi.1010718. eCollection 2022 Dec.

Eleven quick tips for data cleaning and feature engineering

Affiliations

Eleven quick tips for data cleaning and feature engineering

Davide Chicco et al. PLoS Comput Biol. .

Abstract

Applying computational statistics or machine learning methods to data is a key component of many scientific studies, in any field, but alone might not be sufficient to generate robust and reliable outcomes and results. Before applying any discovery method, preprocessing steps are necessary to prepare the data to the computational analysis. In this framework, data cleaning and feature engineering are key pillars of any scientific study involving data analysis and that should be adequately designed and performed since the first phases of the project. We call "feature" a variable describing a particular trait of a person or an observation, recorded usually as a column in a dataset. Even if pivotal, these data cleaning and feature engineering steps sometimes are done poorly or inefficiently, especially by beginners and unexperienced researchers. For this reason, we propose here our quick tips for data cleaning and feature engineering on how to carry out these important preprocessing steps correctly avoiding common mistakes and pitfalls. Although we designed these guidelines with bioinformatics and health informatics scenarios in mind, we believe they can more in general be applied to any scientific area. We therefore target these guidelines to any researcher or practitioners wanting to perform data cleaning or feature engineering. We believe our simple recommendations can help researchers and scholars perform better computational analyses that can lead, in turn, to more solid outcomes and more reliable discoveries.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no conflict of interest.

References

    1. Domingos P. A few useful things to know about machine learning. Commun ACM. 2012;55(10):78–87.
    1. De Jonge E, Van der Loo M. An introduction to data cleaning with R. Statistics Netherlands Heerlen; 2013.
    1. Van den Broeck J, Argeseanu Cunningham S, Eeckels R, Herbst K. Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med. 2005;2(10):e267. doi: 10.1371/journal.pmed.0020267 - DOI - PMC - PubMed
    1. Clemens F. Some essentials of data cleaning: hints and tips. In: United Kingdom Stata Users’ Group Meetings; 2005.
    1. Osborne JW. Best practices in data cleaning: a complete guide to everything you need to do before and after collecting your data. Sage. 2013.
Cite

AltStyle によって変換されたページ (->オリジナル) /